|
Subject: SQL vs. DATA step in a macro involving several steps
Summary: Draw conclusions on realistic testing
Respondent: Ian Whitlock <whitloi1@westat.com>
Dr John Whittington <johnw@mag-net.co.uk> tested an SQL step and a DATA
step on one million records with one variable to conclude the DATA step
is much faster.
Peter Lund <PeterL@OFM.WA.GOV> rewrote the SQL code to show the DATA
step is only a little faster on John's simple DATA step.
Since I offered the code that raised the question and Roger Deangelis
<deangel@HORIZSYS.COM> offered the Observations macro, I tried testing
the macros on more realistic data and found
1) I am missing a semicolon in my code that macro facility
cheerfully added.
2) The data steps and macro arrays win on 100,000 records with 40
variables, but even Peter's estimate overrates the DATA step
code.
435
436 data w ;
437 retain c1 - c20 ' '
438 n1 - n20 .;
439
440 do i = 1 to 100000 ;
441 output ;
442 end ;
443 run ;
NOTE: The data set WORK.W has 100000 observations and 41 variables.
NOTE: The DATA statement used 41.79 seconds.
444
445 options nosource nonotes ;
19:43:09
NOTE: DROPVAR removed the following variables from w
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 C19 C20 N1
N2 N3 N4 N5 N6
N7 N8 N9 N10 N11 N12 N13 N14 N15 N16 N17 N18 N19 N20
to create q
19:44:30
19:44:30
NOTE: Macro DROPVAR removed 40 variable(s) from data set.
19:45:39
I tested my macro with the SQL code first because it had the
flexibility to prevent destruction of the test data.
Ian Whitlock
______________________________ Reply Separator _________________________________
Subject: Re: SQL vs. DATA step
Author: Peter Lund <PeterL@OFM.WA.GOV> at internet-e-mail
Date: 9/23/96 6:33 PM
John (and the rest of SAS-L):
As a frequent user of Proc SQL I thought I'd add my two cents (or the
British equivalent) worth here. There is often a difference in the
performance of SQL and datastep code, but not nearly as significant as is
often portrayed. I tried to replicate your example, using a 1 million
record, 4 variable dataset and the same simple data step that you used.
But, I rewrote the SQL and got significantly different results:
418 data _vars2 (keep=total2) ;
419 set one end = eof ;
420 if b = 2 then total2 + 1 ;
421 if eof ;
422 run ;
NOTE: The data set WORK._VARS2 has 1 observations and 1 variables.
NOTE: The DATA statement used 45.1 seconds.
429 proc sql;
430 create table _vars as
431 select count(*) as total2
432 from one
433 where b eq 2;
NOTE: Table WORK._VARS created, with 1 rows and 1 columns.
434 quit;
NOTE: The PROCEDURE SQL used 56.85 seconds.
Two things are evident here. One, John's machine is faster than mine.
Second, like many other things in SAS, there are many ways to do the same
thing in SQL. I think this is a good comparison of the simplest datastep
and the simplest SQL to do the same task. The datastep is still faster, but
not as much as before.
I've gotten to the point where my first reaction to a task is in terms of
SQL. It is often a short reaction and I'm on to data step/proc code, but
SQL is a very powerful tool. You'd be surprised how often I've found it
handy to merge datasets on disparate variables - something impossible to do
with a merge, but simple with SQL.
My two cents are up.
Pete Lund
WA State Office of Financial Management
peterl@ofm.wa.gov
----------
From: John Whittington
Subject: SQL vs. DATA step (was: Dropping Variabl
Date: Monday, September 23, 1996 2:25PM
On Mon, 23 Sep 1996, Ian Whitlock <whitloi1@WESTATPO.WESTAT.COM> responded
to Bill Shannon's question about dropping variables for which all values
were 0. Ian observed:
> There is an article in Observations 1996 Q3 on dropping variables with
> missing values. This is the same problem only the condition is "= 0".
> You could modify the macro presented there or use the one below.
>
> I wrote the macro DROPVAR in reaction to the older macro techniques
> demonstrated in the article......
I have always tended to steer clear of Proc SQL, except for those data
manipulations (e.g. Cartesian merges) for which it offers a much simpler
approach than DATA step processing, but more recently have been comparing
the two approaches (when applicable) more critically.
One of Ian's 'reactions to the older macro techniques' consisted of using
Proc SQL instead of a DATA step to determine the number of observations for
which the value of a variable was zero. I have compared the two approaches,
in skeletal form, on a test dataset of 1 million observations (x 4
variables), using Win SAS 6.11, with the following resultant log:
176 proc sql ;
177 create table _vars as
178 select
179 sum ( b=2 ) as total
180 from one ;
NOTE: Table WORK._VARS created, with 1 rows and 1 columns.
181 quit ;
NOTE: The PROCEDURE SQL used 2 minutes 0.89 seconds.
182
183 data _vars2 (keep=total2) ;
184 set one end = eof ;
185 if b = 2 then total2 + 1 ;
186 if eof ;
187 run ;
NOTE: The data set WORK._VARS2 has 1 observations and 1 variables.
NOTE: The DATA statement used 36.25 seconds.
I could obvioulsy equally well have put the 'answer' into a macro variable,
rather than into a 1x1 dataset.
It can be seen that the execution time for Proc SQL was about 3.3 times
longer than that for the DATA step approach. This is consistent with my
general view that, where both approaches can be used with a similar degree
of programming effort, the DATA step approach will usually result in a
considerably faster execution, particularly with larger datasets.
Maybe it is sometimes good to be 'reactionary' :-)
John
-----------------------------------------------------------
Dr John Whittington, Voice: +44 1296 730225
Mediscience Services Fax: +44 1296 738893
Twyford Manor, Twyford, E-mail: johnw@mag-net.co.uk
Buckingham MK18 4EL, UK CompuServe: 100517,3677
-----------------------------------------------------------
|