Date: Sat, 6 Sep 2008 16:45:19 -0400
Reply-To: Muthia Kachirayan <muthia.kachirayan@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Muthia Kachirayan <muthia.kachirayan@GMAIL.COM>
Subject: Re: Table lookup puzzle
In-Reply-To: <2fc7f3340809051434x62dd0e92q2c57428ac531685d@mail.gmail.com>
Content-Type: text/plain; charset=ISO-8859-1
> Jay,
>
> Here is the array approach . Hope this meets your million records. Hash
> approach may not be faster than this array approach. Let me know how this
> meets the SQL method.
>
> data a;
> input id mindate maxdate;
> datalines;
> 1 5 8
> 2 4 5
> 3 3 4
> 4 6 6
> 5 5 8
> run;
> data b;
> input time value;
> datalines;
> 5 100
> 6 120
> 6 140
> run;
> data _null_;
> if 0 then set b nobs = num;
> call symputx('num',num);
> stop;
> run;
> %put &num = ;
> data need ;
> do _n_ = 1 by 1 until(eof);
> array k[&num] _temporary_;
> array v[&num] _temporary_;
> set b end = eof;
> k[_n_] = time;
> v[_n_] = value;
> end;
> do until(eof1);
> set a end = eof1;
> do i = 1 to dim(k);
> if (mindate le k[i] le maxdate) then do;
> sum + v[i];
> freq + 1;
> end;
> end;
> output;
> sum = 0;
> freq = 0;
> end;
> keep id mindate maxdate sum freq;
> run;
> proc print data = need;
> run;
>
> Regards,
>
> Muthia Kachirayan
>
Jay,
As both the datasets have millions of records, the processing time will
depend on the number of observations. If na and nb represent the number of
observations of A and B, then na * nb cycles have to be made to find the
SUM and FREQ. This should run to trillions. Can these cycles be reduced? Why
not we combine VALUEs of the duplicate TIME. For instance in the example you
gave last, the dataset B having Time = 6 can be combined to give a new VALUE
of 260 with the count = 2(for freq). If such duplicates are many, then this
would likely reduce the cycles. To try out this, ket me add some
observations to A and B as given below.
data a;
input id mindate maxdate;
datalines;
1 5 8
2 4 5
3 3 4
4 6 6
5 5 8
9 8 9
run;
data b;
input time value;
datalines;
5 100
6 120
6 140
6 150
7 130
7 140
run;
Proc summary or Proc sql can be used to get combined values and frequencies
from B. A hash table can be used instead and simultaneously minimum and
maximum of TIME can be obtained as:
data _null_;
declare hash h();
h.definekey('time');
h.definedata('time','value','count');
h.definedone();
do until(eof);
set b (rename = (value = val)) end = eof;
if h.find() ne 0 then do; value = 0; count = 0; end;
maxv = max(maxv, time);
minv = min(minv, time);
value ++ val;
count ++ 1;
h.replace();
end;
if eof then do;
h.output(dataset:'bb');
call symputx('maxv', maxv);
call symputx('minv', minv);
end;
stop;
run;
%put &maxv;
%put &minv;
Array K[ ] is used to load the VALUEs and N[ ] is used to load the
frequencies. The use of array N[ ] can be dispensed with to save memory if
required.
data need(keep = id mindate maxdate sum freq);
do until(eof);
set bb end = eof;
array k[&minv:&maxv] _temporary_;
array n[&minv:&maxv] _temporary_;
if missing(k[time]) then do;
k[time] = value;
n[time] = count;
end;
end;
do until(eof1);
set a end = eof1;
do _n_ = lbound(k) to hbound(k);
if (mindate LE _n_ LE maxdate) then do;
sum + k[_n_];
freq + n[_n_];
end;
end;
output;
sum = 0;
freq = 0;
end;
run;
proc print data = need;
run;
My earlier code and the present do not require any prior sorting directly or
indirectly. Give it a shot and choose the one that takes the least time.
Regards,
Muthia Kachirayan