```Date: Thu, 2 May 2002 13:48:58 -0700 Reply-To: Dale McLerran Sender: "SAS(r) Discussion" From: Dale McLerran Subject: Re: Random Sample Selection In-Reply-To: Content-Type: text/plain; charset=us-ascii --- "David L. Cassell" wrote: > "Smith, Curtis, Mr, DCAA" replied: > > Well, I take my hat off to Dale for proving my code worked. > > > . . . > > that the selection process is random. My problem - I need to take > far > > more tries in order to get a random result. > > Not quite. Your problem is that when you look at one or two or eight > tries, it just doesn't *look* random. You don't need to do more > iterations. Just trust in the force, Luke. > > David In an offlist communication from Curtis, he told me a little more about why he believed that the process was not necessarily random. Employing the macroized code which I posted yesterday which produced a distribution of player frequencies of being selected, Curtis admitted that while the distribution in 1000 trials appeared to be uniform (everyone having an approximately equal chance of being selected), that there were long lags in the selection process which he did not think should occur. Specifically, the number of draws that were required to get the first win for player 3 in a 9 player game was 64 when he ran my code. Now, when the probability of a win is constant, then the number of losses before the first win has the geometric distribution. The probability function of the geometric distribution is p*((1-p)**x), where x is the number of losses before the first win. If Curtis had picked player 3 as the only player that he was going to track, then the probability of observing 63 losses before the first win for player 3 would be p(63) = (1/9)*((8/9)**63) = 0.00006655 That would go against expectation. Because Curtis most likely selected the player with the most losses before the first win, the geometric distribution is not really correct. In picking the worst case among 9 players, we have biased the process toward more losses, so the probability of observing 63 losses before the first win will not be as small as we have computed above. However, let us assume for the moment that Curtis did indeed select player 3 only and count the number of losses before the first win. Then 63 losses would be an extremely rare event if the probability of selecting player 3 was constant from one trial to the next. If that is true, then we may be observing a long run tendency for approximately equal selection probability, but not a constant selection probability. This does appear to be indeed the case when repeatedly keying off of clock time. Since the clock time is a nonrandom process, the seed stream over a limited time span may be correlated, producing higher probabilities of selection of certain players during a given time span. To investigate this, I revised the code from yesterday to do two different types of sampling: 1) always initialize the seed stream with wall clock time, and 2) start the seed stream with wall clock time, but update the seed at the end of each sampling process so that after the first draw wall clock time is no longer used. I perform 10000 draws under each scenario. I then observe whether the overall distribution of wins conforms to constant probability for each player (1/9 in a 9 person game). I also test for each player whether the number of losses before the next win is drawn from the geometric distribution. (Since I examine all players, these tests are not independent. However, assuming once again that we previously specified player i, then the distribution of number of losses would be geometric.) Below is the code. I have macroized the number of players as well as the number of times the game is played. If you run this code, you are likely to see that the number of losses before the next time that player i wins is not drawn from the geometric distribution when wall clock is repeatedly employed to initialize the seed stream. When the seed stream is updated from one game to the next, then the number of losses is drawn from a geometric distribution. The take home message is that the uniform random number generator does appear to function well as long as you do not build into your process a correlation of starting seed values. When the starting seed values are correlated, then the process will no longer be random. Since all other random number generators employ the uniform random number generator to achieve the desired distribution, all other random number generators are probably functioning adequately as well. Of course, there are other criteria by which we might judge the adequacy of a random number generator. We have not looked to see whether there is correlation between wins for player i and wins for player j. That could also be assessed through use of the geometric distribution. Conditional on observing a win for player i, what is the distribution of number of losses before player j wins? I have spent enough time already, and am satisfied that the uniform PRNG is behaving well as long as I don't build in failure for the PRNG through correlated seed streams. But, if anyone wants to take on the challenge of validating the uniform random number generator further, it would certainly be welcome. Here is my code: /* Macro GEOM will test whether each players distribution of number of misses */ /* before a hit is drawn from the geometric distribution. */ %macro geom_dist(data=); data %do i=1 %to &nplayers; geom&i(keep=geom&i) %end;; retain last_p1-last_p&nplayers 0; set &data; array geom {&nplayers}; array last_p {&nplayers}; geom{player} = _n_ - last_p{player} - 1; %do i=1 %to &nplayers; if player=&i then do; output geom&i; last_p{&i}=_n_; end; %end; run; %do i=1 %to &nplayers; title "Test whether the number of misses before the next hit for player &i"; title2 "is drawn from the geometric distribution with binomial probability 1/&nplayers"; proc freq data=geom&i; tables geom&i / testp=(%do j=1 %to &Ngroups; &&prob&j %end;); format geom&i groups.; run; %end; %mend; *-----------------------------------------------------------------------------* | SELECT WINNER.SAS | *-----------------------------------------------------------------------------*; %macro player_dist(nplayers=9, ntrials=10000); OPTIONS PAGESIZE=55 LINESIZE=132 PAGENO=1; %LET MSS=1; %LET MSD=0; data player; do player=1 to &nplayers; output; end; run; options nonotes; /* Obtain statistics needed to test for geometric distribution */ /* of number of misses between hits on i-th player. */ data _null_; call symput("prob", put(1/&nplayers, best16.)); run; data _null_; cumfreq=0; freq=0; p = 1/&nplayers; notp = (&nplayers-1)/&nplayers; start=0; nout=0; remaining=&ntrials; start=0; do i=0 to 100 while(remaining>0); freq = freq + &ntrials*p*(notp**i); remaining = remaining - &ntrials*p*(notp**i); if freq/&nplayers>25 then do; nout+1; if remaining/&nplayers<25 then do; freq = freq + remaining; remaining = 0; call symput("end"||put(nout, 3. -L), "HIGH"); call symput("Ngroups", put(nout, 3. -L)); end; else do; call symput("end"||put(nout, 3. -L), put(i, 3. -L)); end; call symput("start"||put(nout, 3. -L), put(start, 3. -L)); call symput("freq"||put(nout, 3. -L), put(freq, best32. -L)); prob = freq / &ntrials; call symput("prob"||put(nout, 3. -L), put(prob, best32. -L)); freq=0; start = i+1; end; end; run; proc format; value groups %do i=1 %to &Ngroups; %if &i<10 %then %let j=0&i; %else %let j=&i; &&start&i - &&end&i = "&j: &&start&i - &&end&i" %end; ; run; /* First select winners repeatedly employing clock time seed */ proc datasets; delete winners_cum; quit; %do i=1 %to &ntrials; DATA WINNER; RETAIN K &MSS N; IF _N_=1 THEN N=TOTAL; SET PLAYER NOBS=TOTAL; IF RANUNI(&MSD)<=K/N THEN DO; OUTPUT; K=K-1; END; N=N-1; IF K=0 THEN STOP; RUN; proc append base=winners_cum data=winner; quit; %end; /* Generate statistics on randomness of the process for repeated wall clock seed */ options notes; title 'Distribution of player frequencies of winning'; title2 'Seed is always based on wall clock time'; proc freq data=winners_cum; tables player / testp=(%do i=1 %to &nplayers; &prob %end;); run; %geom_dist(data=winners_cum) /* Now select just first winner employing clock time seed. */ /* Update the seed with each instance of the selection process. */ proc datasets; delete winners_cum2; quit; %do i=1 %to &ntrials; DATA WINNER; RETAIN K &MSS N seed &msd; IF _N_=1 THEN N=TOTAL; SET PLAYER NOBS=TOTAL; call ranuni(seed,ranval); IF ranval<=K/N THEN DO; OUTPUT; K=K-1; END; N=N-1; IF K=0 THEN do; call symput("MSD", put(seed, 16. -L)); STOP; end; RUN; proc append base=winners_cum2 data=winner; quit; %end; options notes; title 'Distribution of player frequencies of winning'; title2 'Initial seed only employs wall clock time'; proc freq data=winners_cum2; tables player / testp=(%do i=1 %to &nplayers; &prob %end;); run; %geom_dist(data=winners_cum2) %mend; %player_dist(nplayers=9, ntrials=10000) Dale ===== --------------------------------------- Dale McLerran Fred Hutchinson Cancer Research Center mailto: dmclerra@fhcrc.org Ph: (206) 667-2926 Fax: (206) 667-5977 --------------------------------------- __________________________________________________ Do You Yahoo!? Yahoo! Health - your guide to health and wellness http://health.yahoo.com ```

Back to: Top of message | Previous page | Main SAS-L page