|
Correction: In my earlier post, I meant to say that SQL "only reads the
rows it needs from B".
Now here's a test where there is no index and PROC SQL and the DATA step
MERGE should be doing more or less the same thing; that is, sequential
matching:
data b;
do b1=1 to 5000000;
x = b1;
b2=ranuni(1);
b3=ranuni(1);
b4=ranuni(1);
b5=ranuni(1);
b6=ranuni(1);
output;
end;
run;
proc datasets library=work;
modify b (sortedby = x); quit;
run;
data a;
do a1=1 to 500;
x=ceil(5000000*ranuni(1));
a2=ranuni(1);
a3='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX';
output;
end;
run;
proc sort data=a;
by x;
run;
proc sql;
create table d as
select *
from a as a, b as b
where a.x=b.x;
quit;
data c;
merge b a (in=k);
by x;
if k;
run;
quit;
proc compare data=c compare=d;
run;
On my system (obviously much less powerful than rdporto1's), the SQL step
took 3 seconds and the DATA step took 8 seconds.
These results depend very much on the values of X in the small table (A). I
generated the data so that these values would fall across the whole range
of X values found in B. If the X values in A are all at the low end, SQL is
smart enough to stop processing after A is exhausted, whereas the DATA step
keeps reading from B.
Incidental points: Compression was counterproductive in this case, so I
removed that option. I used the same seed value in all of the RANUNI calls,
because in fact within a DATA step, the pseudorandom number functions
maintain only one stream.
On Thu, 26 Feb 2004 06:13:02 -0500, ben.powell@CLA.CO.UK wrote:
>(And you should use quit not run to terminate proc datasets).
>So which is faster?
>
>On Wed, 25 Feb 2004 20:27:20 -0500, Howard Schreier
><Howard_Schreier@ITA.DOC.GOV> wrote:
>
>>You have unintentionally stacked the deck.
>>
>>PROC SQL exploits the index and uses a direct access strategy. It only
>>reads the rows it needs from A.
>>
>>The DATA step uses the index to retrieve the rows in order, but it still
>>reads all 5 million of them.
>>
>>On Wed, 25 Feb 2004 18:57:21 -0300, rdporto1 <rdporto1@TERRA.COM.BR>
wrote:
>>
>>>Ok. That's the code and some log.
>>>Sincerely, I don't know why this is happening...
>>>
>>>data b (compress=yes reuse=yes);
>>>do b1=1 to 5000000;
>>>if b1 in (10 20 30) then x=4;
>>>else x=5000000-b1;
>>>b2=ranuni(13);
>>>b3=ranuni(14);
>>>b4=ranuni(15);
>>>b5=ranuni(16);
>>>b6=ranuni(17);
>>>output;
>>>end;
>>>run;
>>>
>>>proc datasets library=work;
>>>modify b;
>>>index create x;
>>>run;
>>>
>>>data a;
>>>do a1=1 to 500;
>>>x=500-a1;
>>>a2=ranuni(87);
>>>a3='XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX';
>>>output;
>>>end;
>>>run;
>>>
>>>proc sort data=a;
>>>by x;
>>>run;
>>>
>>>proc sql;
>>>create table d as
>>>select *
>>>from a as a, b as b
>>>where a.x=b.x;
>>>quit;
>>>
>>>data c;
>>>merge b a (in=k);
>>>by x;
>>>if k;
>>>run;
>>>quit;
>>>
>>>proc compare data=c compare=d;
>>>run;
>>>
>>>Some log...
>>>
>>>359 proc sql;
>>>360 create table d as
>>>361 select *
>>>362 from a as a, b as b
>>>363 where a.x=b.x;
>>>WARNING: Variable x already exists on file WORK.D.
>>>NOTE: Table WORK.D created, with 503 rows and 10 columns.
>>>
>>>364 quit;
>>>NOTE: PROCEDURE SQL used:
>>> real time 0.01 seconds
>>> cpu time 0.01 seconds
>>>
>>>
>>>365 data c;
>>>366 merge b a (in=k);
>>>367 by x;
>>>368 if k;
>>>369 run;
>>>
>>>NOTE: There were 5000000 observations read from the data set WORK.B.
>>>NOTE: There were 500 observations read from the data set WORK.A.
>>>NOTE: The data set WORK.C has 503 observations and 10 variables.
>>>NOTE: DATA statement used:
>>> real time 9.71 seconds
>>> cpu time 9.65 seconds
|