Date: Wed, 31 Mar 1999 13:07:20 +0000 John Whittington "SAS(r) Discussion" John Whittington Re: Randomly selecting records To: KarlGerber@aol.com cc: LINCK@ssb.rochester.edu text/plain; charset="us-ascii"

At 21:54 30/03/99 EST, KarlGerber@aol.com wrote:

>You're right John, the distribution of random variable is irrelevant provided >the size of file to be sorted does not exceed the number of unique values in >the distribution. But lets consider an extreme situation of using dichotomous >variable as a random number generator: > .... >The probability of selection as "first.firm" depends here on the original >order of data, so selection is no longer random.

Karl - Well, yes, I had 'taken it for granted' that we were talking about continuous distributions! As you say, for the selection to be truely random (unrelated to the orginal order of the data), every observation has to be allocated a unique random value. In the real world, with machine precision being what it is, unless one is dealing with an extremely large dataset (in which case this method for obtaining a random sample is probably very unwise, anyway), the chances of 'ties' using any computer-derived continuous random function are pretty small. However, if that is a concern, the risk of any ties occurring is clearly at it's least with a uniform distribution (which is what virtually all of us would use for this purpose) - since the values of the randome variate are then 'maximally spread out'.

My real problem with what you originally wrote was your implication that the distribution chosen for the distribution of the random 'sort' variable was in some way related to the nature of the data. If you recall, you wrote:

>If your data has other than normal distribution >select any of a dozen random number functions >that matches your distribution

Whilst, as above, there are some extreme cases (enormous data sets) in which (because of the finite precision of a PRNG) there could be an argument for choosing a particular random variable distribution, the best choice is always going to be 'uniform', regardless of the distribution of the data.

Kind Regards,

John

---------------------------------------------------------------- Dr John Whittington, Voice: +44 (0) 1296 730225 Mediscience Services Fax: +44 (0) 1296 738893 Twyford Manor, Twyford, E-mail: medisci@powernet.com Buckingham MK18 4EL, UK mediscience@compuserve.com ----------------------------------------------------------------

Back to: Top of message | Previous page | Main SAS-L page