Date: Thu, 2 May 2002 13:48:58 -0700
Reply-To: Dale McLerran <stringplayer_2@YAHOO.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Dale McLerran <stringplayer_2@YAHOO.COM>
Subject: Re: Random Sample Selection
In-Reply-To: <OF85170C65.8DA5ADE6-ON88256BAC.007FBEEE@rtp.epa.gov>
Content-Type: text/plain; charset=us-ascii
--- "David L. Cassell" <Cassell.David@EPAMAIL.EPA.GOV> wrote:
> "Smith, Curtis, Mr, DCAA" <Curtis.Smith@DCAA.MIL> replied:
> > Well, I take my hat off to Dale for proving my code worked.
>
> > . . .
> > that the selection process is random. My problem - I need to take
> far
> > more tries in order to get a random result.
>
> Not quite. Your problem is that when you look at one or two or eight
> tries, it just doesn't *look* random. You don't need to do more
> iterations. Just trust in the force, Luke.
>
> David
In an offlist communication from Curtis, he told me a little more
about why he believed that the process was not necessarily random.
Employing the macroized code which I posted yesterday which produced
a distribution of player frequencies of being selected, Curtis
admitted that while the distribution in 1000 trials appeared to be
uniform (everyone having an approximately equal chance of being
selected), that there were long lags in the selection process which
he did not think should occur. Specifically, the number of draws
that were required to get the first win for player 3 in a 9 player
game was 64 when he ran my code. Now, when the probability of a
win is constant, then the number of losses before the first win has
the geometric distribution. The probability function of the
geometric distribution is p*((1-p)**x), where x is the number of
losses before the first win. If Curtis had picked player 3 as the
only player that he was going to track, then the probability of
observing 63 losses before the first win for player 3 would be
p(63) = (1/9)*((8/9)**63) = 0.00006655
That would go against expectation. Because Curtis most likely
selected the player with the most losses before the first win, the
geometric distribution is not really correct. In picking the worst
case among 9 players, we have biased the process toward more losses,
so the probability of observing 63 losses before the first win will
not be as small as we have computed above. However, let us assume
for the moment that Curtis did indeed select player 3 only and count
the number of losses before the first win. Then 63 losses would be
an extremely rare event if the probability of selecting player 3
was constant from one trial to the next. If that is true, then we
may be observing a long run tendency for approximately equal selection
probability, but not a constant selection probability.
This does appear to be indeed the case when repeatedly keying off
of clock time. Since the clock time is a nonrandom process, the
seed stream over a limited time span may be correlated, producing
higher probabilities of selection of certain players during a given
time span. To investigate this, I revised the code from yesterday
to do two different types of sampling: 1) always initialize the
seed stream with wall clock time, and 2) start the seed stream with
wall clock time, but update the seed at the end of each sampling
process so that after the first draw wall clock time is no longer
used. I perform 10000 draws under each scenario. I then observe
whether the overall distribution of wins conforms to constant
probability for each player (1/9 in a 9 person game). I also test
for each player whether the number of losses before the next win is
drawn from the geometric distribution. (Since I examine all
players, these tests are not independent. However, assuming once
again that we previously specified player i, then the distribution
of number of losses would be geometric.)
Below is the code. I have macroized the number of players as well
as the number of times the game is played. If you run this code,
you are likely to see that the number of losses before the next time
that player i wins is not drawn from the geometric distribution when
wall clock is repeatedly employed to initialize the seed stream.
When the seed stream is updated from one game to the next, then
the number of losses is drawn from a geometric distribution. The
take home message is that the uniform random number generator does
appear to function well as long as you do not build into your process
a correlation of starting seed values. When the starting seed values
are correlated, then the process will no longer be random. Since
all other random number generators employ the uniform random number
generator to achieve the desired distribution, all other random
number generators are probably functioning adequately as well.
Of course, there are other criteria by which we might judge the
adequacy of a random number generator. We have not looked to see
whether there is correlation between wins for player i and wins for
player j. That could also be assessed through use of the geometric
distribution. Conditional on observing a win for player i, what is
the distribution of number of losses before player j wins? I have
spent enough time already, and am satisfied that the uniform PRNG is
behaving well as long as I don't build in failure for the PRNG through
correlated seed streams. But, if anyone wants to take on the challenge
of validating the uniform random number generator further, it would
certainly be welcome.
Here is my code:
/* Macro GEOM will test whether each players distribution of number of
misses */
/* before a hit is drawn from the geometric distribution.
*/
%macro geom_dist(data=);
data %do i=1 %to &nplayers; geom&i(keep=geom&i) %end;;
retain last_p1-last_p&nplayers 0;
set &data;
array geom {&nplayers};
array last_p {&nplayers};
geom{player} = _n_ - last_p{player} - 1;
%do i=1 %to &nplayers;
if player=&i then do;
output geom&i;
last_p{&i}=_n_;
end;
%end;
run;
%do i=1 %to &nplayers;
title "Test whether the number of misses before the next hit for
player &i";
title2 "is drawn from the geometric distribution with binomial
probability 1/&nplayers";
proc freq data=geom&i;
tables geom&i / testp=(%do j=1 %to &Ngroups; &&prob&j %end;);
format geom&i groups.;
run;
%end;
%mend;
*-----------------------------------------------------------------------------*
| SELECT WINNER.SAS
|
*-----------------------------------------------------------------------------*;
%macro player_dist(nplayers=9, ntrials=10000);
OPTIONS PAGESIZE=55 LINESIZE=132 PAGENO=1;
%LET MSS=1;
%LET MSD=0;
data player;
do player=1 to &nplayers;
output;
end;
run;
options nonotes;
/* Obtain statistics needed to test for geometric distribution */
/* of number of misses between hits on i-th player. */
data _null_;
call symput("prob", put(1/&nplayers, best16.));
run;
data _null_;
cumfreq=0;
freq=0;
p = 1/&nplayers;
notp = (&nplayers-1)/&nplayers;
start=0;
nout=0;
remaining=&ntrials;
start=0;
do i=0 to 100 while(remaining>0);
freq = freq + &ntrials*p*(notp**i);
remaining = remaining - &ntrials*p*(notp**i);
if freq/&nplayers>25 then do;
nout+1;
if remaining/&nplayers<25 then do;
freq = freq + remaining;
remaining = 0;
call symput("end"||put(nout, 3. -L), "HIGH");
call symput("Ngroups", put(nout, 3. -L));
end;
else do;
call symput("end"||put(nout, 3. -L), put(i, 3. -L));
end;
call symput("start"||put(nout, 3. -L), put(start, 3. -L));
call symput("freq"||put(nout, 3. -L), put(freq, best32. -L));
prob = freq / &ntrials;
call symput("prob"||put(nout, 3. -L), put(prob, best32. -L));
freq=0;
start = i+1;
end;
end;
run;
proc format;
value groups
%do i=1 %to &Ngroups;
%if &i<10 %then %let j=0&i;
%else %let j=&i;
&&start&i - &&end&i = "&j: &&start&i - &&end&i"
%end;
;
run;
/* First select winners repeatedly employing clock time seed */
proc datasets;
delete winners_cum;
quit;
%do i=1 %to &ntrials;
DATA WINNER;
RETAIN K &MSS N;
IF _N_=1 THEN N=TOTAL;
SET PLAYER NOBS=TOTAL;
IF RANUNI(&MSD)<=K/N THEN
DO;
OUTPUT;
K=K-1;
END;
N=N-1;
IF K=0 THEN STOP;
RUN;
proc append base=winners_cum data=winner;
quit;
%end;
/* Generate statistics on randomness of the process for repeated wall
clock seed */
options notes;
title 'Distribution of player frequencies of winning';
title2 'Seed is always based on wall clock time';
proc freq data=winners_cum;
tables player / testp=(%do i=1 %to &nplayers; &prob %end;);
run;
%geom_dist(data=winners_cum)
/* Now select just first winner employing clock time seed. */
/* Update the seed with each instance of the selection process. */
proc datasets;
delete winners_cum2;
quit;
%do i=1 %to &ntrials;
DATA WINNER;
RETAIN K &MSS N seed &msd;
IF _N_=1 THEN N=TOTAL;
SET PLAYER NOBS=TOTAL;
call ranuni(seed,ranval);
IF ranval<=K/N THEN
DO;
OUTPUT;
K=K-1;
END;
N=N-1;
IF K=0 THEN do;
call symput("MSD", put(seed, 16. -L));
STOP;
end;
RUN;
proc append base=winners_cum2 data=winner;
quit;
%end;
options notes;
title 'Distribution of player frequencies of winning';
title2 'Initial seed only employs wall clock time';
proc freq data=winners_cum2;
tables player / testp=(%do i=1 %to &nplayers; &prob %end;);
run;
%geom_dist(data=winners_cum2)
%mend;
%player_dist(nplayers=9, ntrials=10000)
Dale
=====
---------------------------------------
Dale McLerran
Fred Hutchinson Cancer Research Center
mailto: dmclerra@fhcrc.org
Ph: (206) 667-2926
Fax: (206) 667-5977
---------------------------------------
__________________________________________________
Do You Yahoo!?
Yahoo! Health - your guide to health and wellness
http://health.yahoo.com