```Date: Thu, 16 Sep 1999 13:22:46 +0100 Reply-To: John Whittington Sender: "SAS(r) Discussion" From: John Whittington Subject: Re: Random Selection: Anything More Elegant Comments: To: F a b r i z i o In-Reply-To: <199909161107.MAA19259@mail-relay.power.net.uk> Content-Type: text/plain; charset="us-ascii" At 12:54 16/09/99 +0200, F a b r i z i o wrote: >Robert Virgile wrote in message <37DF1C1D.E7B2DBFC@mediaone.net>... >>.... >>data just20; >>set balls nobs=totalobs; >>retain needed 20; >>if _n_ > 1 then totalobs = totalobs - 1; >>testnum = ranuni(0); >>if testnum < needed/totalobs; >>needed = needed - 1; >>run; > >IMHO this algorithm does not guarantee you will always get exactly 20 balls. >I even think an exact random sample cannot be reached in a single pass. >Chances are you run out of balls (i.e. finish the datastep) before 20 >balls were selected. You may then end up with too few balls. Fabrizio, I think there are a couple of minor errors in Robert's (untested) code - but the following realisation of the algorithm in question *does* work as desired: data just20 (drop = needed totalobs); retain needed 20 n; if _n_=1 then n = total; set one nobs = total; if ranuni(0) le needed/totalobs then do; output; needed = needed - 1; end; totalobs = totalobs - 1 ; if needed = 0 then stop ; run ; I agree that the algorithm is highly couter-intuitive - I simply did not believe it could work when I first met it, and therefore had to convince myself that it really did work! Since I painstaking typed it out in ASCII a while back for someone else, thought you might possible like the following 'mathematical' explanation of why it works ..... ... The definition of 'random' obviously means that the probability of selection of *any* of the elements has to be the same, and equal to k/n. Consider the very first item considered. There is a k/n probability of it being selected into the sample. When it comes to the next item, n will *always* have been decremented by the algorithm to (n-1) and there are two possibilities as regards k: a)... there is a k/n probability that the first item was selected, in which case the algorithm will have decremented k to (k-1). In this case, the probability of the second item being selected during this second step is therefore ((k-1)/(n-1)) and thus the overall probability of this having happened is thus: (k/n) * ((k-1)/(n-1)) .... (1) b)... there is a (1 -(k/n)) probability that the first item was *not* selected, in which case the algorithm will *not* have decremented k. In this case, the probability of the second item being selected during this second step is therefore (k/(n-1)) and thus the overall probability of this having happened is thus: (1-(k/n)) * (k/(n-1)) .... (2) Adding (1) and (2) we get the total probability of the second item being selected into the sample, namely: [ (k/n) * ((k-1)/(n-1)) ] + [ (1-(k/n)) * (k/(n-1)) ] ( k ) ( k-1 ) ( k )( k ) = ( - ) (-----) + ( 1 - - )(-----) ( n ) ( n-1 ) ( n )( n-1 ) ( k ) ( k-1 ) ( n - k )( k ) = ( - ) (-----) + ( ----- )(-----) ( n ) ( n-1 ) ( n )( n-1 ) ( k-1 ) ( k ) = k (------ ) + (n-k)( ------ ) ( n(n-1) ) ( n(n-1) ) k(k-1) + k(n-k) = --------------- n(n-1) k^2 - k + kn - k^2 = ------------------ n(n-1) k(n-1) = ------ n(n-1) k = --- n Hence the probability of the second item, using this algorithm is also k/n. The same argument can obviously used for any stage during the selection process, so that the probability of *any* particular item beings elected is k/n - hence, random. I hope I'm explained that clearly enough - algebra via ASCII is not too clever! Regards, John ---------------------------------------------------------------- Dr John Whittington, Voice: +44 (0) 1296 730225 Mediscience Services Fax: +44 (0) 1296 738893 Twyford Manor, Twyford, E-mail: medisci@powernet.com Buckingham MK18 4EL, UK mediscience@compuserve.com ---------------------------------------------------------------- ```

Back to: Top of message | Previous page | Main SAS-L page