Date: Wed, 9 Nov 2005 19:43:28 -0800
Reply-To: David L Cassell <davidlcassell@MSN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: David L Cassell <davidlcassell@MSN.COM>
Subject: Re: Random Sample on Dataset Subject to Where Statement
Content-Type: text/plain; format=flowed
>For purposes of creating summary statistics about every variable in a
>particular 'large' dataset, I want to first take a simple random sample of
>records to use in the calculations. My usual method for generating such a
>sample is through the use of direct access to rows using the point= option
>in the set statement. However, this method falls apart when I want to AT
>THE SAME TIME allow the user of my application to specify a where
First, I don't think 500,000 records is a large data set. You ought to be
to do summary stats on the whole data set. I would look to see what summary
stats are slowing things down, and see if *that* can be optimized. I would
check whether the entire data set fits in available RAM, and if so use the
SASFILE statement if you are going to be making a lot of passes through the
set for multiple WHERE clauses or multiple user requests. I would think
you may have one or two 'summary stats' which are clogging up the pipes
here, since most of what we call 'summary stats' can be computed in a single
pass through the data set. Are you computing upper or lower quantiles?
Iterative descriptive stats like, say, biweight?
>The problem is: take a random sample of 5,000 records from dataset A which
>contains 500,000 records subject to some where statement, without prior
>knowledge about whether the dataset subject to the where clause will have
>more or less than 5,000 records (the chosen sample size).
As you already know, this may have drawbacks. You are *also* introducing
random noise due to the sampling process (unless the WHERE clause returns
<=5000 records). Are you sure that SRS without replacement is optimal
for your data? Are there factors which could be taken into account to
the accuracy? Could stratified sampling or control sampling give you any
real benefits? Just asking - inquiring minds want to know.
>My current way of dealing with this is to (1) create dataset B which is
>dataset A subject to the where clause, (2) check if B contains more or
>less than 5,000 records, and (3) if B contains more than 5,000 records
>then use my usual simple random sample program to sample B down to 5,000
>records. This is extremely inefficient but I don't know a better way...
Using PROC SURVEYSELECT loses its speed advantage for the same reason.
the proc has to scan the whole resultant to figure out the 'population' size
before doing the standard one-pass through the data set.
Unfortunately, I don't think this is inefficient. I say 'unfortunately'
I don't have an improvement.
>So, does anyone know a better way??? Note that sampling WITHOUT
>replacement must be used.
I think you're stuck checking the subset for size, as long as you have this
model. It may be that there are alternative solutions if you're willing to
back and analyze what is really costing you the time when you're doing the
summary stats, and you would be willing to re-design from there.
>Final note: I tested proc SURVEYSELECT based on other SAS-L postings and
>found it to be very slow.
I'm not sure what you mean here. Do you mean that using a WHERE clause
in PROC SURVEYSELECT and therefore making it do a couple passes through
the whole data set is slower than using POINT= *after* already doing the
subset? Do you mean that reading the whole data set once is slower than
using POINT= to peek at 1% of the data in a sequential manner to minimize
disk read? Well yeah, I'd agree with both of these.
David L. Cassell
3115 NW Norwood Pl.
Corvallis OR 97330
Is your PC infected? Get a FREE online computer virus scan from McAfeeŽ