Date: Tue, 8 Nov 2005 21:52:01 -0500
Reply-To: Paul Walker <walker.627@OSU.EDU>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Paul Walker <walker.627@OSU.EDU>
Subject: Random Sample on Dataset Subject to Where Statement
For purposes of creating summary statistics about every variable in a
particular 'large' dataset, I want to first take a simple random sample of
records to use in the calculations. My usual method for generating such a
sample is through the use of direct access to rows using the point= option
in the set statement. However, this method falls apart when I want to AT
THE SAME TIME allow the user of my application to specify a where
statement.
The problem is: take a random sample of 5,000 records from dataset A which
contains 500,000 records subject to some where statement, without prior
knowledge about whether the dataset subject to the where clause will have
more or less than 5,000 records (the chosen sample size).
My current way of dealing with this is to (1) create dataset B which is
dataset A subject to the where clause, (2) check if B contains more or
less than 5,000 records, and (3) if B contains more than 5,000 records
then use my usual simple random sample program to sample B down to 5,000
records. This is extremely inefficient but I don't know a better way...
So, does anyone know a better way??? Note that sampling WITHOUT
replacement must be used.
Final note: I tested proc SURVEYSELECT based on other SAS-L postings and
found it to be very slow.
|