LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (May 2006, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Mon, 22 May 2006 20:31:25 +0000
Reply-To:   toby dunn <tobydunn@HOTMAIL.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   toby dunn <tobydunn@HOTMAIL.COM>
Subject:   Re: Finding observation # associated with a value
Comments:   To: ya.huang@AMYLIN.COM
In-Reply-To:   <200605221953.k4MJq6Zr024931@mailgw.cc.uga.edu>
Content-Type:   text/plain; format=flowed

YA ,

That presupposes that one knows what value to set the firstobs at. I would care to say in most cases (like the one the poster presented ) the user has no idea where it will start and it makes the code require human intervention with every run using a different data set. If you compensate for that lack of functionality in the firstobs case I bet the two will equal out. And it would make more sense to use the where statement.

Toby Dunn

From: Ya Huang <ya.huang@AMYLIN.COM> Reply-To: Ya Huang <ya.huang@AMYLIN.COM> To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Finding observation # associated with a value Date: Mon, 22 May 2006 15:53:31 -0400

I got the impression that the original poster already know how to use where options to subset the data. He just wanted to know if there is a way to get the firstobs without reading all the data. His assumption is that firstobs= option is more efficient than where= option. To confirm that firstobs method is faster, I did the following simple test:

1 data a; 2 do i=1 to 1000000; 3 output; 4 end; 5 run;

NOTE: The data set WORK.A has 1000000 observations and 1 variables. NOTE: DATA statement used: real time 0.17 seconds cpu time 0.17 seconds

6 7 data c; 8 set a (firstobs=500000); 9 run;

NOTE: There were 500001 observations read from the data set WORK.A. NOTE: The data set WORK.C has 500001 observations and 1 variables. NOTE: DATA statement used: real time 0.20 seconds cpu time 0.18 seconds

10 11 data d; 12 set a (where=(i >=500000)); 13 run;

NOTE: There were 500001 observations read from the data set WORK.A. WHERE i>=500000; NOTE: The data set WORK.D has 500001 observations and 1 variables. NOTE: DATA statement used: real time 0.29 seconds cpu time 0.29 seconds

As you can see, the firstobs= options took 0.18 seconds cpu time, while the where= options took 0.29 seconds.

So it seems that if we can somehow get the firstobs without reading all the data, it can really save some time. I myself don't think it is possible. Anyone not think so?

Ya

On Mon, 22 May 2006 12:10:39 -0700, David L Cassell <davidlcassell@MSN.COM> wrote:

>paulvonhippel@YAHOO.COM wrote: >>I have a large data set that looks something like this: >> >>X Y >>garbage 2 >>garbage 1 >>garbage 3 >>... ... >>usable 4 >>usable 16 >>usable 31 >>... ... >> >>I'd like to know which observation is the first one with X="usable" so >>that I don't have to waste time reading the observations with >>X="garbage". For example, if the first 10,000 lines are garbage, I can >>save time by using the subset mydata (firstobs=10001). >> >>Any suggestions most appreciated. > >One simple approach you might try is the use of the WHERE clause. > > >data new; > set old(where=(X="usable")); > <more code here> > . > . > . > > >This will really help if you have garbage lines interspersed with >usable lines. And it will permit you to cut the data down on >a wide variaety of criteria, so SAS can skip over the dreck. > >HTH, >David >-- >David L. Cassell >mathematical statistician >Design Pathways >3115 NW Norwood Pl. >Corvallis OR 97330 > >_________________________________________________________________ >Donít just search. Find. Check out the new MSN Search! >http://search.msn.click-url.com/go/onm00200636ave/direct/01/


Back to: Top of message | Previous page | Main SAS-L page