Correction: In my earlier post, I meant to say that SQL "only reads the
rows it needs from B".
Now here's a test where there is no index and PROC SQL and the DATA step
MERGE should be doing more or less the same thing; that is, sequential
do b1=1 to 5000000;
x = b1;
proc datasets library=work;
modify b (sortedby = x); quit;
do a1=1 to 500;
proc sort data=a;
create table d as
from a as a, b as b
merge b a (in=k);
proc compare data=c compare=d;
On my system (obviously much less powerful than rdporto1's), the SQL step
took 3 seconds and the DATA step took 8 seconds.
These results depend very much on the values of X in the small table (A). I
generated the data so that these values would fall across the whole range
of X values found in B. If the X values in A are all at the low end, SQL is
smart enough to stop processing after A is exhausted, whereas the DATA step
keeps reading from B.
Incidental points: Compression was counterproductive in this case, so I
removed that option. I used the same seed value in all of the RANUNI calls,
because in fact within a DATA step, the pseudorandom number functions
maintain only one stream.
On Thu, 26 Feb 2004 06:13:02 -0500, ben.powell@CLA.CO.UK wrote:
>(And you should use quit not run to terminate proc datasets).
>So which is faster?
>On Wed, 25 Feb 2004 20:27:20 -0500, Howard Schreier
>>You have unintentionally stacked the deck.
>>PROC SQL exploits the index and uses a direct access strategy. It only
>>reads the rows it needs from A.
>>The DATA step uses the index to retrieve the rows in order, but it still
>>reads all 5 million of them.
>>On Wed, 25 Feb 2004 18:57:21 -0300, rdporto1 <rdporto1@TERRA.COM.BR>
>>>Ok. That's the code and some log.
>>>Sincerely, I don't know why this is happening...
>>>data b (compress=yes reuse=yes);
>>>do b1=1 to 5000000;
>>>if b1 in (10 20 30) then x=4;
>>>proc datasets library=work;
>>>index create x;
>>>do a1=1 to 500;
>>>proc sort data=a;
>>>create table d as
>>>from a as a, b as b
>>>merge b a (in=k);
>>>proc compare data=c compare=d;
>>>359 proc sql;
>>>360 create table d as
>>>361 select *
>>>362 from a as a, b as b
>>>363 where a.x=b.x;
>>>WARNING: Variable x already exists on file WORK.D.
>>>NOTE: Table WORK.D created, with 503 rows and 10 columns.
>>>NOTE: PROCEDURE SQL used:
>>> real time 0.01 seconds
>>> cpu time 0.01 seconds
>>>365 data c;
>>>366 merge b a (in=k);
>>>367 by x;
>>>368 if k;
>>>NOTE: There were 5000000 observations read from the data set WORK.B.
>>>NOTE: There were 500 observations read from the data set WORK.A.
>>>NOTE: The data set WORK.C has 503 observations and 10 variables.
>>>NOTE: DATA statement used:
>>> real time 9.71 seconds
>>> cpu time 9.65 seconds