LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (February 2004, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Thu, 26 Feb 2004 09:09:27 -0500
Reply-To:   Howard Schreier <Howard_Schreier@ITA.DOC.GOV>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Howard Schreier <Howard_Schreier@ITA.DOC.GOV>
Subject:   Re: Data Step vs. Proc sql

Correction: In my earlier post, I meant to say that SQL "only reads the rows it needs from B".

Now here's a test where there is no index and PROC SQL and the DATA step MERGE should be doing more or less the same thing; that is, sequential matching:

data b; do b1=1 to 5000000; x = b1; b2=ranuni(1); b3=ranuni(1); b4=ranuni(1); b5=ranuni(1); b6=ranuni(1); output; end; run;

proc datasets library=work; modify b (sortedby = x); quit; run;

data a; do a1=1 to 500; x=ceil(5000000*ranuni(1)); a2=ranuni(1); a3='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'; output; end; run;

proc sort data=a; by x; run;

proc sql; create table d as select * from a as a, b as b where a.x=b.x; quit;

data c; merge b a (in=k); by x; if k; run; quit;

proc compare data=c compare=d; run;

On my system (obviously much less powerful than rdporto1's), the SQL step took 3 seconds and the DATA step took 8 seconds.

These results depend very much on the values of X in the small table (A). I generated the data so that these values would fall across the whole range of X values found in B. If the X values in A are all at the low end, SQL is smart enough to stop processing after A is exhausted, whereas the DATA step keeps reading from B.

Incidental points: Compression was counterproductive in this case, so I removed that option. I used the same seed value in all of the RANUNI calls, because in fact within a DATA step, the pseudorandom number functions maintain only one stream.

On Thu, 26 Feb 2004 06:13:02 -0500, ben.powell@CLA.CO.UK wrote:

>(And you should use quit not run to terminate proc datasets). >So which is faster? > >On Wed, 25 Feb 2004 20:27:20 -0500, Howard Schreier ><Howard_Schreier@ITA.DOC.GOV> wrote: > >>You have unintentionally stacked the deck. >> >>PROC SQL exploits the index and uses a direct access strategy. It only >>reads the rows it needs from A. >> >>The DATA step uses the index to retrieve the rows in order, but it still >>reads all 5 million of them. >> >>On Wed, 25 Feb 2004 18:57:21 -0300, rdporto1 <rdporto1@TERRA.COM.BR> wrote: >> >>>Ok. That's the code and some log. >>>Sincerely, I don't know why this is happening... >>> >>>data b (compress=yes reuse=yes); >>>do b1=1 to 5000000; >>>if b1 in (10 20 30) then x=4; >>>else x=5000000-b1; >>>b2=ranuni(13); >>>b3=ranuni(14); >>>b4=ranuni(15); >>>b5=ranuni(16); >>>b6=ranuni(17); >>>output; >>>end; >>>run; >>> >>>proc datasets library=work; >>>modify b; >>>index create x; >>>run; >>> >>>data a; >>>do a1=1 to 500; >>>x=500-a1; >>>a2=ranuni(87); >>>a3='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX'; >>>output; >>>end; >>>run; >>> >>>proc sort data=a; >>>by x; >>>run; >>> >>>proc sql; >>>create table d as >>>select * >>>from a as a, b as b >>>where a.x=b.x; >>>quit; >>> >>>data c; >>>merge b a (in=k); >>>by x; >>>if k; >>>run; >>>quit; >>> >>>proc compare data=c compare=d; >>>run; >>> >>>Some log... >>> >>>359 proc sql; >>>360 create table d as >>>361 select * >>>362 from a as a, b as b >>>363 where a.x=b.x; >>>WARNING: Variable x already exists on file WORK.D. >>>NOTE: Table WORK.D created, with 503 rows and 10 columns. >>> >>>364 quit; >>>NOTE: PROCEDURE SQL used: >>> real time 0.01 seconds >>> cpu time 0.01 seconds >>> >>> >>>365 data c; >>>366 merge b a (in=k); >>>367 by x; >>>368 if k; >>>369 run; >>> >>>NOTE: There were 5000000 observations read from the data set WORK.B. >>>NOTE: There were 500 observations read from the data set WORK.A. >>>NOTE: The data set WORK.C has 503 observations and 10 variables. >>>NOTE: DATA statement used: >>> real time 9.71 seconds >>> cpu time 9.65 seconds


Back to: Top of message | Previous page | Main SAS-L page