Date: Thu, 11 Jul 2002 10:53:36 -0400
Reply-To: "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM>
Subject: Re: A datastep question...
Content-Type: text/plain; charset=iso-8859-1
Benjamen,
Let me add to what others have already said (i.e. drop the unneeded variable
on the output) that, although it seems logical to drop as many unneeded
variables on the input to enhance performance, it does not achieve all that
much - contrary to expectations. For example, look at these steps run on the
mainframe (to gather EXCPs):
2 data a ;
3 array a (1000) ;
4 do a1 = 1 to 1e4 ;
5 output ;
6 end ;
7 run ;
NOTE: The data set USER.A has 10000 observations and 1000 variables.
8 data b ;
9 set a ( keep = a1 ) ;
10 run ;
NOTE: There were 10000 observations read from the data set USER.A.
NOTE: The data set USER.B has 10000 observations and 1 variables.
NOTE: The DATA statement used the following resources:
CPU time - 00:00:00.07
Elapsed time - 00:00:05.85
EXCP count - 258
Task memory - 3561K (99K data, 3462K program)
11 data c ( keep = a1 ) ;
12 set a ;
13 run ;
NOTE: There were 10000 observations read from the data set USER.A.
NOTE: The data set USER.C has 10000 observations and 1 variables.
NOTE: The DATA statement used the following resources:
CPU time - 00:00:00.15
Elapsed time - 00:00:05.81
EXCP count - 257
Task memory - 3691K (229K data, 3462K program)
Apparently, despite carrying all 999 extra variables in memory, the step
where the ballast is drop on the output runs practically as fast as the one
where they are (logically) dropped on the input. The explanation is in the
EXCP count - they account for the slowest part of the job, and above, they
are for all purposes are the same. EXCPs show how much labor it takes to
move data from the disk to the buffer, while moving them from the buffer to
operating memory is much easier. When you keep 1 variable out of 1000 on the
input, it (unfortunately) does not cause SAS to move only that part of the
record to the buffer - this operation speaks the language of whole data
pages regardless of KEEPs or DROPs.
The moral is, if you can drop the ballast on the input, it is the most
logical and programmatically correct thing to do, and by all means do so -
it is nonsensical to pollute operating memory and the compiler symbol table
with extra stuff (it takes more memory, too), but if you really need to drop
it on the way out, it does not exact a high performance toll.
Kind regards,
====================
Paul M. Dorfman
Jacksonville, FL
====================
> From: Witness [mailto:bmeyer67@CALVIN.EDU]
>
> Well, not necessarily a datastep, but...
>
> I have been starting to use WHERE clauses in set statements
> to limit my
> data instead of using the subsetting-IF. However, I still find that I
> need to have the COLUMN that I am using in the WHERE clause
> to in order
> to run the step.
>
> Example:
>
> DATA InputData;
> SET DB.Input(KEEP=ID1 ID2 ID3 WHERE=(ID3=&X);
> RUN;
>
> If I needed ID3 later on, then I wouldn't mind having to have
> it in the
> returned columns. However, I don't and it takes up space. (I remove it
> at the next time that I subset my data, such as a PROC MEANS
> procedure.)
>
> Now, if I were to implement this SQL, I don't need to have the column
> ID3 there.
>
> Example:
>
> PROC SQL;
> CREATE TABLE InputData AS
> SELECT ID1,ID2 FROM DB.Input WHERE ID3=&X;
> QUIT;
>
> Is there some way that with the DATA step's (or PROCs if it were to be
> done there) that I could drop a column but use it in the WHERE?
>
> Note: This is not a question of SAS vs. SQL. But a question of SAS in
> how to do it, or why can't it be done? If I were worried about space,
> the SQL may be more appealing even though it takes a little longer to
> run, at least in my test scenario from one of my scripts.
> (Difference in
> time is 8.57 seconds real time, and 9.98 seconds CPU time. That also
> includes a sort, and the data step is done through a PROC SORT
> statement.)
>
> Thanks,
>
> Benjamen R. Meyer
>
Blue Cross Blue Shield of Florida, Inc., and its subsidiary and
affiliate companies are not responsible for errors or omissions in this e-mail message. Any personal comments made in this e-mail do not reflect the views of Blue Cross Blue Shield of Florida, Inc.
|