LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (May 2005, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Wed, 1 Jun 2005 11:24:03 +1000
Reply-To:   John Franklin <jfranklin@QCOTTON.COM.AU>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   John Franklin <jfranklin@QCOTTON.COM.AU>
Subject:   Re: select certain number of records
Content-Type:   text/plain; charset=US-ASCII

Geez David... Take a pill.

I was just showing a simple sample without replacement for those without SAS/STAT, you can make the sample extraction as complicated as you like.

john

>>> "David L. Cassell" <cassell.david@EPAMAIL.EPA.GOV> 1/06/2005 11:16:32 am >>>

John Franklin <jfranklin@QCOTTON.COM.AU> helpfully replied: > For this of you without SAS/STAT just use the point = option on the set > statement > > here is a random sample without replacement > > > data work.subset(drop = obsleft sampsize); > /* set sample size to the number of sample records required */ > sampsize = 10; > obsleft = totobs; > do while(sampsize>0); > pickit + 1; > if ranuni(0) < sampsize/obsleft then do; > set largedataset point = pickit > nobs = totobs; > output; > sampsize = sampsize - 1; > end; > obsleft = obsleft - 1; > end; > stop; > run; > > Simple , Easy and uses only Base SAS.

And it's extremely efficient for small samples sizes out of large data sets.

However, I have a couple comments.

[1] When I use an approach like this, I always use a fixed seed that I choose first. There's nothing like the joy of generating a sample for a client which cannot be reproduced.. because the client will need the sample to be reproduced as soon as you turn around. I would alter the code like so:

data work.subset(drop = obsleft sampsize seed); /* set sample size to the number of sample records required */ sampsize = 10; /* set a random seed */ seed = 4958674; /* compute the sampling weight and inclusion probability */ SampleWeight = totobs / sampsize; InclProb = 1 / sampleWeight ; /* now back to John's code */ obsleft = totobs; do while(sampsize>0); pickit + 1; if ranuni(seed) < sampsize/obsleft then do; set largedataset point = pickit nobs = totobs; output; sampsize = sampsize - 1; end; obsleft = obsleft - 1; end; stop; run;

[2] I have unfortunately found that some users simply don't believe this works. If you can't walk them through an induction proof of the algorithm, you're left with taking another approach to deal with the Pointy-Haired Boss. Oddly enough, the same users will believe whatever comes out of a black box called PROC SURVEYSELECT just because it's a SAS proc.

[3] While this is a nifty technique for simple random sampling without replacement from a data set which can be addressed like this (POINTOBS=1) with a known accessible NOBS and no observations marked for deletion (DELOBS=0), some of these things can sneak up and bite you. As I've found out. So keep an eye out for potential problems. [If DELOBS>0, then the actual number of usable records is less than NOBS and you have to make sure that any selected record is not one of the ones marked for deletion. If you don't have SAS/FSP, then this is probably not an issue for you.] If you need any extension (something like sampling with weights or ...) then this doesn't extend easily.

[4] Since the point is usually to get a real probability sample, code which also generates the sampling weights and inclusion probabilities is a good thing. I threw that in too. Yes, I know, they're fixed values for the sample data set, but if you don't keep them in the sampledata set, how will you keep track of them? PROC SURVEYSELECT would at least print them out for you in the list output.

[5] Now that we're going to all the trouble of taking a probability sample from a sample frame, we still have the task of forcing people to analyze the data correctly. PROC MEANS and PROC UNIVARIATE don't get the variance computations right.

Just my $0.02, David -- David Cassell, CSC Cassell.David@epa.gov Senior computing specialist mathematical statistician


Back to: Top of message | Previous page | Main SAS-L page