Date: Thu, 18 Dec 2003 17:05:35 -0500 Howard Schreier "SAS(r) Discussion" Howard Schreier Re: Time Interval Aggregation to an Event

I'm not getting the performance improvements I was hoping to see. Ordinarily I would put the problem aside and take another look tomorrow, but I'll be away for a few days so I'll post it now and hope that one of the sqlheads can tune it up.

First, I created macrovariables for the critical dimensions:

%let maxtime=20000; %let lookback=180;

Then I scaled up the test data by looping:

data one; input time vol type \$; do time = time to &MAXTIME by 16; output; end; cards; 1 9 b 2 2 b 2 3 s 2 4 s 5 10 b 6 12 s 7 13 s 7 22 b 9 21 b 12 9 s 14 5 b 15 2 b 16 1 b ;

proc sort data=one; by time; run;

As before, I added sequence numbers:

data two; set one; obsnum ++ 1; run;

Here's the tweaked SQL code:

proc sql; create view cross as select 0 as offset, * from two union select 1 as offset, * from two ; create table three(drop=obsnum) as select distinct two.*, coalesce(sum( cross.vol*(cross.type='b') ),0) as bvol, coalesce(sum(-cross.vol*(cross.type='s') ),0) as svol, calculated bvol + calculated svol as net_vol from two left join cross on int(two.time/&LOOKBACK) - offset = int(cross.time/&LOOKBACK) and two.time-cross.time > 0 and two.time-cross.time <= &LOOKBACK group by two.obsnum order by two.obsnum ; quit;

The idea is to divide time into arbitrary segments of the specified length (&LOOKBACK). They are arbitrary in that it does not really matter where the boundaries fall. The subexpressions referencing the INT function do that. Then for any given observation, the past observations to be rolled into the BVOL and SVOL calculations either fall in the same segment or the immediately preceding one. That's what OFFSET is intended to reference.

So the added condition in the ON clause is actually redundant in that it is less restrictive than the inequality conditions. However, it is an equality condition, which the optimizer can exploit. Whether there is a practical benefit depends on the data, but since 180 is a small fraction of 20000, I expected to see significant faster execution.

For reference, here is the naive code, rewritten slightly to make use of the macrovariables:

proc sql; create table slow(drop=obsnum) as select distinct two.*, coalesce(sum( cross.vol*(cross.type='b') ),0) as bvol, coalesce(sum(-cross.vol*(cross.type='s') ),0) as svol, calculated bvol + calculated svol as net_vol from two left join two as cross on two.time-cross.time > 0 and two.time-cross.time <=&LOOKBACK group by two.obsnum order by two.obsnum ; quit;

I ran PROC COMPARE and confirmed that the two solutions yield identical results.

However, the time saving is only about 50 percent, less than I expected.

On Thu, 18 Dec 2003 12:34:46 -0500, Howard Schreier <Howard_Schreier@ITA.DOC.GOV> wrote: