|
John Franklin <jfranklin@QCOTTON.COM.AU> helpfully replied:
> For this of you without SAS/STAT just use the point = option on the
set
> statement
>
> here is a random sample without replacement
>
>
> data work.subset(drop = obsleft sampsize);
> /* set sample size to the number of sample records required */
> sampsize = 10;
> obsleft = totobs;
> do while(sampsize>0);
> pickit + 1;
> if ranuni(0) < sampsize/obsleft then do;
> set largedataset point = pickit
> nobs = totobs;
> output;
> sampsize = sampsize - 1;
> end;
> obsleft = obsleft - 1;
> end;
> stop;
> run;
>
> Simple , Easy and uses only Base SAS.
And it's extremely efficient for small samples sizes out of large
data sets.
However, I have a couple comments.
[1] When I use an approach like this, I always use a fixed seed that
I choose first. There's nothing like the joy of generating a sample
for a client which cannot be reproduced.. because the client will
need the sample to be reproduced as soon as you turn around. I would
alter the code like so:
data work.subset(drop = obsleft sampsize seed);
/* set sample size to the number of sample records required */
sampsize = 10;
/* set a random seed */
seed = 4958674;
/* compute the sampling weight and inclusion probability */
SampleWeight = totobs / sampsize;
InclProb = 1 / sampleWeight ;
/* now back to John's code */
obsleft = totobs;
do while(sampsize>0);
pickit + 1;
if ranuni(seed) < sampsize/obsleft then do;
set largedataset point = pickit nobs = totobs;
output;
sampsize = sampsize - 1;
end;
obsleft = obsleft - 1;
end;
stop;
run;
[2] I have unfortunately found that some users simply don't
believe this works. If you can't walk them through an induction
proof of the algorithm, you're left with taking another approach
to deal with the Pointy-Haired Boss. Oddly enough, the same
users will believe whatever comes out of a black box called
PROC SURVEYSELECT just because it's a SAS proc.
[3] While this is a nifty technique for simple random sampling
without replacement from a data set which can be addressed like
this (POINTOBS=1) with a known accessible NOBS and no observations
marked for deletion (DELOBS=0), some of these things can sneak up
and bite you. As I've found out. So keep an eye out for potential
problems. [If DELOBS>0, then the actual number of usable records
is less than NOBS and you have to make sure that any selected
record is not one of the ones marked for deletion. If you don't
have SAS/FSP, then this is probably not an issue for you.] If you
need any extension (something like sampling with weights or ...)
then this doesn't extend easily.
[4] Since the point is usually to get a real probability sample,
code which also generates the sampling weights and inclusion
probabilities is a good thing. I threw that in too. Yes, I know,
they're fixed values for the sample data set, but if you don't
keep them in the sampledata set, how will you keep track of them?
PROC SURVEYSELECT would at least print them out for you in the
list output.
[5] Now that we're going to all the trouble of taking a probability
sample from a sample frame, we still have the task of forcing people
to analyze the data correctly. PROC MEANS and PROC UNIVARIATE
don't get the variance computations right.
Just my $0.02,
David
--
David Cassell, CSC
Cassell.David@epa.gov
Senior computing specialist
mathematical statistician
|