Date: Thu, 16 Sep 1999 13:22:46 +0100
Reply-To: John Whittington <medisci@POWERNET.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: John Whittington <medisci@POWERNET.COM>
Subject: Re: Random Selection: Anything More Elegant
In-Reply-To: <199909161107.MAA19259@mail-relay.power.net.uk>
Content-Type: text/plain; charset="us-ascii"
At 12:54 16/09/99 +0200, F a b r i z i o wrote:
>Robert Virgile wrote in message <37DF1C1D.E7B2DBFC@mediaone.net>...
>>....
>>data just20;
>>set balls nobs=totalobs;
>>retain needed 20;
>>if _n_ > 1 then totalobs = totalobs - 1;
>>testnum = ranuni(0);
>>if testnum < needed/totalobs;
>>needed = needed - 1;
>>run;
>
>IMHO this algorithm does not guarantee you will always get exactly 20 balls.
>I even think an exact random sample cannot be reached in a single pass.
>Chances are you run out of balls (i.e. finish the datastep) before 20
>balls were selected. You may then end up with too few balls.
Fabrizio, I think there are a couple of minor errors in Robert's (untested)
code - but the following realisation of the algorithm in question *does*
work as desired:
data just20 (drop = needed totalobs);
retain needed 20 n;
if _n_=1 then n = total;
set one nobs = total;
if ranuni(0) le needed/totalobs then do;
output;
needed = needed - 1;
end;
totalobs = totalobs - 1 ;
if needed = 0 then stop ;
run ;
I agree that the algorithm is highly couter-intuitive - I simply did not
believe it could work when I first met it, and therefore had to convince
myself that it really did work! Since I painstaking typed it out in ASCII
a while back for someone else, thought you might possible like the
following 'mathematical' explanation of why it works .....
... The definition of 'random' obviously means that the probability of
selection of *any* of the elements has to be the same, and equal to k/n.
Consider the very first item considered. There is a k/n probability of it
being selected into the sample. When it comes to the next item, n will
*always* have been decremented by the algorithm to (n-1) and there are two
possibilities as regards k:
a)... there is a k/n probability that the first item was selected, in which
case the algorithm will have decremented k to (k-1). In this case, the
probability of the second item being selected during this second step is
therefore ((k-1)/(n-1)) and thus the overall probability of this having
happened is thus:
(k/n) * ((k-1)/(n-1)) .... (1)
b)... there is a (1 -(k/n)) probability that the first item was *not*
selected, in which case the algorithm will *not* have decremented k. In
this case, the probability of the second item being selected during this
second step is therefore (k/(n-1)) and thus the overall probability of this
having happened is thus:
(1-(k/n)) * (k/(n-1)) .... (2)
Adding (1) and (2) we get the total probability of the second item being
selected into the sample, namely:
[ (k/n) * ((k-1)/(n-1)) ] + [ (1-(k/n)) * (k/(n-1)) ]
( k ) ( k-1 ) ( k )( k )
= ( - ) (-----) + ( 1 - - )(-----)
( n ) ( n-1 ) ( n )( n-1 )
( k ) ( k-1 ) ( n - k )( k )
= ( - ) (-----) + ( ----- )(-----)
( n ) ( n-1 ) ( n )( n-1 )
( k-1 ) ( k )
= k (------ ) + (n-k)( ------ )
( n(n-1) ) ( n(n-1) )
k(k-1) + k(n-k)
= ---------------
n(n-1)
k^2 - k + kn - k^2
= ------------------
n(n-1)
k(n-1)
= ------
n(n-1)
k
= ---
n
Hence the probability of the second item, using this algorithm is also k/n.
The same argument can obviously used for any stage during the selection
process, so that the probability of *any* particular item beings elected is
k/n - hence, random.
I hope I'm explained that clearly enough - algebra via ASCII is not too
clever!
Regards,
John
----------------------------------------------------------------
Dr John Whittington, Voice: +44 (0) 1296 730225
Mediscience Services Fax: +44 (0) 1296 738893
Twyford Manor, Twyford, E-mail: medisci@powernet.com
Buckingham MK18 4EL, UK mediscience@compuserve.com
----------------------------------------------------------------