|
YA ,
That presupposes that one knows what value to set the firstobs at. I would
care to say in most cases (like the one the poster presented ) the user has
no idea where it will start and it makes the code require human intervention
with every run using a different data set. If you compensate for that lack
of functionality in the firstobs case I bet the two will equal out. And it
would make more sense to use the where statement.
Toby Dunn
From: Ya Huang <ya.huang@AMYLIN.COM>
Reply-To: Ya Huang <ya.huang@AMYLIN.COM>
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Finding observation # associated with a value
Date: Mon, 22 May 2006 15:53:31 -0400
I got the impression that the original poster already know
how to use where options to subset the data. He just wanted to know
if there is a way to get the firstobs without reading all the
data. His assumption is that firstobs= option is more efficient than
where= option. To confirm that firstobs method is faster, I did the
following simple test:
1 data a;
2 do i=1 to 1000000;
3 output;
4 end;
5 run;
NOTE: The data set WORK.A has 1000000 observations and 1 variables.
NOTE: DATA statement used:
real time 0.17 seconds
cpu time 0.17 seconds
6
7 data c;
8 set a (firstobs=500000);
9 run;
NOTE: There were 500001 observations read from the data set WORK.A.
NOTE: The data set WORK.C has 500001 observations and 1 variables.
NOTE: DATA statement used:
real time 0.20 seconds
cpu time 0.18 seconds
10
11 data d;
12 set a (where=(i >=500000));
13 run;
NOTE: There were 500001 observations read from the data set WORK.A.
WHERE i>=500000;
NOTE: The data set WORK.D has 500001 observations and 1 variables.
NOTE: DATA statement used:
real time 0.29 seconds
cpu time 0.29 seconds
As you can see, the firstobs= options took 0.18 seconds cpu time,
while the where= options took 0.29 seconds.
So it seems that if we can somehow get the firstobs without reading
all the data, it can really save some time. I myself don't think
it is possible. Anyone not think so?
Ya
On Mon, 22 May 2006 12:10:39 -0700, David L Cassell <davidlcassell@MSN.COM>
wrote:
>paulvonhippel@YAHOO.COM wrote:
>>I have a large data set that looks something like this:
>>
>>X Y
>>garbage 2
>>garbage 1
>>garbage 3
>>... ...
>>usable 4
>>usable 16
>>usable 31
>>... ...
>>
>>I'd like to know which observation is the first one with X="usable" so
>>that I don't have to waste time reading the observations with
>>X="garbage". For example, if the first 10,000 lines are garbage, I can
>>save time by using the subset mydata (firstobs=10001).
>>
>>Any suggestions most appreciated.
>
>One simple approach you might try is the use of the WHERE clause.
>
>
>data new;
> set old(where=(X="usable"));
> <more code here>
> .
> .
> .
>
>
>This will really help if you have garbage lines interspersed with
>usable lines. And it will permit you to cut the data down on
>a wide variaety of criteria, so SAS can skip over the dreck.
>
>HTH,
>David
>--
>David L. Cassell
>mathematical statistician
>Design Pathways
>3115 NW Norwood Pl.
>Corvallis OR 97330
>
>_________________________________________________________________
>Don’t just search. Find. Check out the new MSN Search!
>http://search.msn.click-url.com/go/onm00200636ave/direct/01/
|