LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (September 1996, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Mon, 23 Sep 1996 20:22:39 EDT
Reply-To:   whitloi1@WESTATPO.WESTAT.COM
Sender:   "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From:   Ian Whitlock <whitloi1@WESTATPO.WESTAT.COM>
Subject:   Re[2]: SQL vs. DATA step
Comments:   To: Peter Lund <PeterL@OFM.WA.GOV>, johnw@mag-net.co.uk

Subject: SQL vs. DATA step in a macro involving several steps Summary: Draw conclusions on realistic testing Respondent: Ian Whitlock <whitloi1@westat.com>

Dr John Whittington <johnw@mag-net.co.uk> tested an SQL step and a DATA step on one million records with one variable to conclude the DATA step is much faster.

Peter Lund <PeterL@OFM.WA.GOV> rewrote the SQL code to show the DATA step is only a little faster on John's simple DATA step.

Since I offered the code that raised the question and Roger Deangelis <deangel@HORIZSYS.COM> offered the Observations macro, I tried testing the macros on more realistic data and found

1) I am missing a semicolon in my code that macro facility cheerfully added.

2) The data steps and macro arrays win on 100,000 records with 40 variables, but even Peter's estimate overrates the DATA step code.

435 436 data w ; 437 retain c1 - c20 ' ' 438 n1 - n20 .; 439 440 do i = 1 to 100000 ; 441 output ; 442 end ; 443 run ;

NOTE: The data set WORK.W has 100000 observations and 41 variables. NOTE: The DATA statement used 41.79 seconds.

444 445 options nosource nonotes ; 19:43:09 NOTE: DROPVAR removed the following variables from w C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12 N13 N14 N15 N16 N17 N18 N19 N20 to create q 19:44:30 19:44:30

NOTE: Macro DROPVAR removed 40 variable(s) from data set.

19:45:39

I tested my macro with the SQL code first because it had the flexibility to prevent destruction of the test data.

Ian Whitlock

______________________________ Reply Separator _________________________________ Subject: Re: SQL vs. DATA step Author: Peter Lund <PeterL@OFM.WA.GOV> at internet-e-mail Date: 9/23/96 6:33 PM

John (and the rest of SAS-L): As a frequent user of Proc SQL I thought I'd add my two cents (or the British equivalent) worth here. There is often a difference in the performance of SQL and datastep code, but not nearly as significant as is often portrayed. I tried to replicate your example, using a 1 million record, 4 variable dataset and the same simple data step that you used. But, I rewrote the SQL and got significantly different results:

418 data _vars2 (keep=total2) ; 419 set one end = eof ; 420 if b = 2 then total2 + 1 ; 421 if eof ; 422 run ;

NOTE: The data set WORK._VARS2 has 1 observations and 1 variables. NOTE: The DATA statement used 45.1 seconds.

429 proc sql; 430 create table _vars as 431 select count(*) as total2 432 from one 433 where b eq 2; NOTE: Table WORK._VARS created, with 1 rows and 1 columns.

434 quit; NOTE: The PROCEDURE SQL used 56.85 seconds.

Two things are evident here. One, John's machine is faster than mine. Second, like many other things in SAS, there are many ways to do the same thing in SQL. I think this is a good comparison of the simplest datastep and the simplest SQL to do the same task. The datastep is still faster, but not as much as before.

I've gotten to the point where my first reaction to a task is in terms of SQL. It is often a short reaction and I'm on to data step/proc code, but SQL is a very powerful tool. You'd be surprised how often I've found it handy to merge datasets on disparate variables - something impossible to do with a merge, but simple with SQL.

My two cents are up.

Pete Lund WA State Office of Financial Management peterl@ofm.wa.gov

---------- From: John Whittington Subject: SQL vs. DATA step (was: Dropping Variabl Date: Monday, September 23, 1996 2:25PM

On Mon, 23 Sep 1996, Ian Whitlock <whitloi1@WESTATPO.WESTAT.COM> responded to Bill Shannon's question about dropping variables for which all values were 0. Ian observed:

> There is an article in Observations 1996 Q3 on dropping variables with > missing values. This is the same problem only the condition is "= 0". > You could modify the macro presented there or use the one below. > > I wrote the macro DROPVAR in reaction to the older macro techniques > demonstrated in the article......

I have always tended to steer clear of Proc SQL, except for those data manipulations (e.g. Cartesian merges) for which it offers a much simpler approach than DATA step processing, but more recently have been comparing the two approaches (when applicable) more critically.

One of Ian's 'reactions to the older macro techniques' consisted of using Proc SQL instead of a DATA step to determine the number of observations for which the value of a variable was zero. I have compared the two approaches, in skeletal form, on a test dataset of 1 million observations (x 4 variables), using Win SAS 6.11, with the following resultant log:

176 proc sql ; 177 create table _vars as 178 select 179 sum ( b=2 ) as total 180 from one ; NOTE: Table WORK._VARS created, with 1 rows and 1 columns.

181 quit ; NOTE: The PROCEDURE SQL used 2 minutes 0.89 seconds.

182 183 data _vars2 (keep=total2) ; 184 set one end = eof ; 185 if b = 2 then total2 + 1 ; 186 if eof ; 187 run ;

NOTE: The data set WORK._VARS2 has 1 observations and 1 variables. NOTE: The DATA statement used 36.25 seconds.

I could obvioulsy equally well have put the 'answer' into a macro variable, rather than into a 1x1 dataset.

It can be seen that the execution time for Proc SQL was about 3.3 times longer than that for the DATA step approach. This is consistent with my general view that, where both approaches can be used with a similar degree of programming effort, the DATA step approach will usually result in a considerably faster execution, particularly with larger datasets.

Maybe it is sometimes good to be 'reactionary' :-)

John

----------------------------------------------------------- Dr John Whittington, Voice: +44 1296 730225 Mediscience Services Fax: +44 1296 738893 Twyford Manor, Twyford, E-mail: johnw@mag-net.co.uk Buckingham MK18 4EL, UK CompuServe: 100517,3677 -----------------------------------------------------------


Back to: Top of message | Previous page | Main SAS-L page