LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (November 2005, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 9 Nov 2005 19:43:28 -0800
Reply-To:     David L Cassell <davidlcassell@MSN.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         David L Cassell <davidlcassell@MSN.COM>
Subject:      Re: Random Sample on Dataset Subject to Where Statement
In-Reply-To:  <200511090252.jA90Y4d2004390@malibu.cc.uga.edu>
Content-Type: text/plain; format=flowed

walker.627@OSU.EDU wrote: >For purposes of creating summary statistics about every variable in a >particular 'large' dataset, I want to first take a simple random sample of >records to use in the calculations. My usual method for generating such a >sample is through the use of direct access to rows using the point= option >in the set statement. However, this method falls apart when I want to AT >THE SAME TIME allow the user of my application to specify a where >statement.

First, I don't think 500,000 records is a large data set. You ought to be able to do summary stats on the whole data set. I would look to see what summary stats are slowing things down, and see if *that* can be optimized. I would also check whether the entire data set fits in available RAM, and if so use the SASFILE statement if you are going to be making a lot of passes through the data set for multiple WHERE clauses or multiple user requests. I would think that you may have one or two 'summary stats' which are clogging up the pipes here, since most of what we call 'summary stats' can be computed in a single pass through the data set. Are you computing upper or lower quantiles? Iterative descriptive stats like, say, biweight?

>The problem is: take a random sample of 5,000 records from dataset A which >contains 500,000 records subject to some where statement, without prior >knowledge about whether the dataset subject to the where clause will have >more or less than 5,000 records (the chosen sample size).

As you already know, this may have drawbacks. You are *also* introducing random noise due to the sampling process (unless the WHERE clause returns <=5000 records). Are you sure that SRS without replacement is optimal for your data? Are there factors which could be taken into account to improve the accuracy? Could stratified sampling or control sampling give you any real benefits? Just asking - inquiring minds want to know.

>My current way of dealing with this is to (1) create dataset B which is >dataset A subject to the where clause, (2) check if B contains more or >less than 5,000 records, and (3) if B contains more than 5,000 records >then use my usual simple random sample program to sample B down to 5,000 >records. This is extremely inefficient but I don't know a better way...

Using PROC SURVEYSELECT loses its speed advantage for the same reason. the proc has to scan the whole resultant to figure out the 'population' size before doing the standard one-pass through the data set.

Unfortunately, I don't think this is inefficient. I say 'unfortunately' because I don't have an improvement.

>So, does anyone know a better way??? Note that sampling WITHOUT >replacement must be used.

I think you're stuck checking the subset for size, as long as you have this process model. It may be that there are alternative solutions if you're willing to step back and analyze what is really costing you the time when you're doing the summary stats, and you would be willing to re-design from there.

>Final note: I tested proc SURVEYSELECT based on other SAS-L postings and >found it to be very slow.

I'm not sure what you mean here. Do you mean that using a WHERE clause in PROC SURVEYSELECT and therefore making it do a couple passes through the whole data set is slower than using POINT= *after* already doing the subset? Do you mean that reading the whole data set once is slower than using POINT= to peek at 1% of the data in a sequential manner to minimize disk read? Well yeah, I'd agree with both of these.

David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330

_________________________________________________________________ Is your PC infected? Get a FREE online computer virus scan from McAfeeŽ Security. http://clinic.mcafee.com/clinic/ibuy/campaign.asp?cid=3963


Back to: Top of message | Previous page | Main SAS-L page