|Date: ||Mon, 18 Sep 2006 22:01:31 -0700|
|Reply-To: ||David L Cassell <davidlcassell@MSN.COM>|
|Sender: ||"SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>|
|From: ||David L Cassell <davidlcassell@MSN.COM>|
|Subject: ||Re: option obs= and SQL|
|Content-Type: ||text/plain; format=flowed|
> >-----Original Message-----
> >From: Rickards, Clinton (GE Money) [mailto:firstname.lastname@example.org]
> >Sent: Friday, September 15, 2006 12:34 PM
> >To: Pardee, Roy; SAS-L@LISTSERV.UGA.EDU
> >Subject: RE: Re: option obs= and SQL
> >Thanks, Roy.
> >Our filter was objecting to the Google group but the URL below gave me
> >enough to search the archive (which the filter does not object to).
> >Interesting thread. It looks like there really is no way to do
> >everything we want: control the number of obs read from a physical file.
> >Thanks for your help...
> >Yeah. I think the best ideas I've heard for it are Ian's last word on
> >that thread (namely, create a separate subset test table with a
> >WHERE-less select * and then run you test query on that).
> >I can also recall (I think) David Cassell advocating a PROC SURVEYSELECT
> >call to come up w/a closer-to-representative subset of the full table.
> >I've never tried that, so can't comment on how fast it can rip through
> >the full table...
>Yeah, that would probably have been me. I'm the cause of most
>problems around here. :-)
>The use of the first n records for any kind of stat analysis is usually
>A Bad Thing. The use of the first n records for debugging is likely
>to cause you to miss interesting data features. But going through
>the entire data set to get n records takes longer.
>As such, it takes PROC SURVEYSELECT about as much time as one
>pass through the data set with a DATA step.
>David L. Cassell
>3115 NW Norwood Pl.
>Corvallis OR 97330
>I agree with you about selecting the first _n_ obs is not a very good test
>statistically. The purpose of the test was to get my code to execute so I
>could correct variable name issues, invalid function calls, and the like.
Yep, that's why I pointed out that using PROC SURVEYSELECT will take
a lot longer than picking out the first 5000 of millions of records.
But I find that later testing stages usually need more than the first K
records, just to catch all the nuances (and nuisances) of the data.
Fencepost errors, errors of scale, quirks of the data... They all need
to be checked out thoroughly, as you already know.
As Ron would point out, you have to make sure your data has the
states Oregon and Nebraska to make sure that your macro that checks
the state values doesn't have a subtle bug. :-) :-)
David L. Cassell
3115 NW Norwood Pl.
Corvallis OR 97330
The next generation of Search—say hello!