LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (July 2002, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Thu, 11 Jul 2002 10:53:36 -0400
Reply-To:     "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM>
Subject:      Re: A datastep question...
Comments: To: Witness <bmeyer67@CALVIN.EDU>
Content-Type: text/plain; charset=iso-8859-1

Benjamen,

Let me add to what others have already said (i.e. drop the unneeded variable on the output) that, although it seems logical to drop as many unneeded variables on the input to enhance performance, it does not achieve all that much - contrary to expectations. For example, look at these steps run on the mainframe (to gather EXCPs):

2 data a ; 3 array a (1000) ; 4 do a1 = 1 to 1e4 ; 5 output ; 6 end ; 7 run ;

NOTE: The data set USER.A has 10000 observations and 1000 variables.

8 data b ; 9 set a ( keep = a1 ) ; 10 run ;

NOTE: There were 10000 observations read from the data set USER.A. NOTE: The data set USER.B has 10000 observations and 1 variables. NOTE: The DATA statement used the following resources: CPU time - 00:00:00.07 Elapsed time - 00:00:05.85 EXCP count - 258 Task memory - 3561K (99K data, 3462K program)

11 data c ( keep = a1 ) ; 12 set a ; 13 run ;

NOTE: There were 10000 observations read from the data set USER.A. NOTE: The data set USER.C has 10000 observations and 1 variables. NOTE: The DATA statement used the following resources: CPU time - 00:00:00.15 Elapsed time - 00:00:05.81 EXCP count - 257 Task memory - 3691K (229K data, 3462K program)

Apparently, despite carrying all 999 extra variables in memory, the step where the ballast is drop on the output runs practically as fast as the one where they are (logically) dropped on the input. The explanation is in the EXCP count - they account for the slowest part of the job, and above, they are for all purposes are the same. EXCPs show how much labor it takes to move data from the disk to the buffer, while moving them from the buffer to operating memory is much easier. When you keep 1 variable out of 1000 on the input, it (unfortunately) does not cause SAS to move only that part of the record to the buffer - this operation speaks the language of whole data pages regardless of KEEPs or DROPs.

The moral is, if you can drop the ballast on the input, it is the most logical and programmatically correct thing to do, and by all means do so - it is nonsensical to pollute operating memory and the compiler symbol table with extra stuff (it takes more memory, too), but if you really need to drop it on the way out, it does not exact a high performance toll.

Kind regards, ==================== Paul M. Dorfman Jacksonville, FL ====================

> From: Witness [mailto:bmeyer67@CALVIN.EDU] > > Well, not necessarily a datastep, but... > > I have been starting to use WHERE clauses in set statements > to limit my > data instead of using the subsetting-IF. However, I still find that I > need to have the COLUMN that I am using in the WHERE clause > to in order > to run the step. > > Example: > > DATA InputData; > SET DB.Input(KEEP=ID1 ID2 ID3 WHERE=(ID3=&X); > RUN; > > If I needed ID3 later on, then I wouldn't mind having to have > it in the > returned columns. However, I don't and it takes up space. (I remove it > at the next time that I subset my data, such as a PROC MEANS > procedure.) > > Now, if I were to implement this SQL, I don't need to have the column > ID3 there. > > Example: > > PROC SQL; > CREATE TABLE InputData AS > SELECT ID1,ID2 FROM DB.Input WHERE ID3=&X; > QUIT; > > Is there some way that with the DATA step's (or PROCs if it were to be > done there) that I could drop a column but use it in the WHERE? > > Note: This is not a question of SAS vs. SQL. But a question of SAS in > how to do it, or why can't it be done? If I were worried about space, > the SQL may be more appealing even though it takes a little longer to > run, at least in my test scenario from one of my scripts. > (Difference in > time is 8.57 seconds real time, and 9.98 seconds CPU time. That also > includes a sort, and the data step is done through a PROC SORT > statement.) > > Thanks, > > Benjamen R. Meyer >

Blue Cross Blue Shield of Florida, Inc., and its subsidiary and affiliate companies are not responsible for errors or omissions in this e-mail message. Any personal comments made in this e-mail do not reflect the views of Blue Cross Blue Shield of Florida, Inc.


Back to: Top of message | Previous page | Main SAS-L page