Date: Fri, 5 Dec 2008 03:10:04 -0600
Reply-To: torche@i-minds.be
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Francois Torche <torche@I-MINDS.BE>
Subject: Re: Efficiency Question
Content-Type: text/plain;charset=utf-8
The problem when processing very large datasets is the I/O. The more you
can keep in memory, the faster it will be. If you have enough RAM, you
might try to load one (or both?) table(s) in a hash table...
François
----- Original Message -----
From: "Rob" <data_sleuth_2006@yahoo.com>
Sent: Thu, December 4, 2008 19:15
Subject:Efficiency Question
I have several large datasets (multi-million records) that I need to
find matching records across. The following code works, but I am
hoping that someone might be able to help with speed and efficiency.
Thanks in advance...
******************
CODE
******************
data a;
input var1 var2;
cards;
1 1
2 2
3 3
4 4
5 5
6 6
7 7
;
run;
data b;
input var1 var2 var3;
cards;
7 0 0
6 7 3
1 2 1
2 2 8
;
run;
proc sort data=a;
by var1 var2;
run;
proc sort data=b;
by var1 var2 var3;
run;
/*
The goal is to compare datasets A and B and identify where EITHER
case1 OR case2 are met:
case1 - If var1 in dataset A = var1 in dataset B AND var2 in dataset
A = var2 in dataset B
OR
case2 - If var1 in dataset A = var1 in dataset B AND var2 in dataset
A = var3 in dataset B
THEN
- take action -
*/
data c;
merge a (in=in_a)
b (in=in_b rename=(var2=b_var2));
by var1;
format match_var 1.;
match_var = 0;
if in_a & in_b then do;
if var2 = var3 then match_var = 3;
*if
the variables have
different names, then it works;
else if var2 = b_var2 then match_var = 2; *if the
variables have
the same name, then one must be renamed;
end;
if in_a then output;
run;
----- End of original message -----