LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (September 2008, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Sat, 6 Sep 2008 16:45:19 -0400
Reply-To:   Muthia Kachirayan <muthia.kachirayan@GMAIL.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Muthia Kachirayan <muthia.kachirayan@GMAIL.COM>
Subject:   Re: Table lookup puzzle
In-Reply-To:   <2fc7f3340809051434x62dd0e92q2c57428ac531685d@mail.gmail.com>
Content-Type:   text/plain; charset=ISO-8859-1

> Jay, > > Here is the array approach . Hope this meets your million records. Hash > approach may not be faster than this array approach. Let me know how this > meets the SQL method. > > data a; > input id mindate maxdate; > datalines; > 1 5 8 > 2 4 5 > 3 3 4 > 4 6 6 > 5 5 8 > run; > data b; > input time value; > datalines; > 5 100 > 6 120 > 6 140 > run; > data _null_; > if 0 then set b nobs = num; > call symputx('num',num); > stop; > run; > %put &num = ; > data need ; > do _n_ = 1 by 1 until(eof); > array k[&num] _temporary_; > array v[&num] _temporary_; > set b end = eof; > k[_n_] = time; > v[_n_] = value; > end; > do until(eof1); > set a end = eof1; > do i = 1 to dim(k); > if (mindate le k[i] le maxdate) then do; > sum + v[i]; > freq + 1; > end; > end; > output; > sum = 0; > freq = 0; > end; > keep id mindate maxdate sum freq; > run; > proc print data = need; > run; > > Regards, > > Muthia Kachirayan >

Jay,

As both the datasets have millions of records, the processing time will depend on the number of observations. If na and nb represent the number of observations of A and B, then na * nb cycles have to be made to find the SUM and FREQ. This should run to trillions. Can these cycles be reduced? Why not we combine VALUEs of the duplicate TIME. For instance in the example you gave last, the dataset B having Time = 6 can be combined to give a new VALUE of 260 with the count = 2(for freq). If such duplicates are many, then this would likely reduce the cycles. To try out this, ket me add some observations to A and B as given below.

data a; input id mindate maxdate; datalines; 1 5 8 2 4 5 3 3 4 4 6 6 5 5 8 9 8 9 run;

data b; input time value; datalines; 5 100 6 120 6 140 6 150 7 130 7 140 run;

Proc summary or Proc sql can be used to get combined values and frequencies from B. A hash table can be used instead and simultaneously minimum and maximum of TIME can be obtained as:

data _null_; declare hash h(); h.definekey('time'); h.definedata('time','value','count'); h.definedone();

do until(eof); set b (rename = (value = val)) end = eof; if h.find() ne 0 then do; value = 0; count = 0; end; maxv = max(maxv, time); minv = min(minv, time); value ++ val; count ++ 1; h.replace(); end; if eof then do; h.output(dataset:'bb'); call symputx('maxv', maxv); call symputx('minv', minv); end; stop; run;

%put &maxv; %put &minv;

Array K[ ] is used to load the VALUEs and N[ ] is used to load the frequencies. The use of array N[ ] can be dispensed with to save memory if required.

data need(keep = id mindate maxdate sum freq); do until(eof); set bb end = eof; array k[&minv:&maxv] _temporary_; array n[&minv:&maxv] _temporary_; if missing(k[time]) then do; k[time] = value; n[time] = count; end; end; do until(eof1); set a end = eof1; do _n_ = lbound(k) to hbound(k); if (mindate LE _n_ LE maxdate) then do; sum + k[_n_]; freq + n[_n_]; end; end; output; sum = 0; freq = 0; end; run;

proc print data = need; run;

My earlier code and the present do not require any prior sorting directly or indirectly. Give it a shot and choose the one that takes the least time.

Regards,

Muthia Kachirayan


Back to: Top of message | Previous page | Main SAS-L page