Date: Fri, 9 Apr 2004 08:00:46 -0400
Reply-To: "Richard A. DeVenezia" <radevenz@IX.NETCOM.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Richard A. DeVenezia" <radevenz@IX.NETCOM.COM>
Subject: Re: records between dates
Robert wrote:
>
> For each date, I need to count the number events within the next 14
> day window. Such that the resulting dataset looks as follows:
>
> Obs id date events date+14 number of events
> from date to date+14
>
> 1 1 11/12/01 1 11/26/01 8
> 2 1 11/13/01 2 11/27/01 7
You indicate a repeated id, so presume you need to do this within a by
group.
Here are two SQL ways, the second being faster. The SQL ways do cartesian
joins, which do not scale well. I wouldn't use it for data with more than
10K rows.
*------------------------;
proc sql _method;
create table counts as
select outer.*
, outer.date+14 as endate format = mmddyy10.
, ( select sum(inner.events)
from foo as inner
where inner.id=outer.id
and inner.date between outer.date and outer.date+14
) as nevents
from foo as outer
;
quit;
proc sql _method;
create table counts2 as
select distinct A.*, A.date+14 as endate format=mmddyy10.,
sum(B.events) as nevents
from foo as A, foo as B
where A.id = B.id
and B.date between A.date and A.date+14
group by A.id, A.date
;
quit;
*------------------------;
Consider now a larger problem, 1M rows of data broken into 4,000 groups of
250 observations each. The SQL outer join will run and run and run.
*------------------------;
data foo (index=(focus=(id date)));
do id = 1 to 4000;
do date = today()-249 to today();
* do date = today()-32 to today();
events = int (10*ranuni(0));
* events = 1;
output;
end;
end;
format date mmddyy10.;
run;
*------------------------;
You can speed things up considerably by precomputing the 1M eventSums and
then using SQL to merge the eventSums with the original data.
Q: How do you precompute a value from rolling ranges ?
A: Ring arrays.
*------------------------;
* day to day+14 contains 15 days;
* only works if at most one day per obs, can handle sequences of rows where
event counts of some days are not present;
* ix tracks the current slot for new data;
* ix0 tracks the current slot that is start of ring;
* when size of ring is large enough ix can never 'overrun' ix0 and correct
results ensue.
* when size of ring is too small, incorrect results will occurr and infinite
loops may.;
data endsum (keep=id date0 sum14days);
* not _temporary_, each array element implicitly set to missing at top of
data step;
array d_ring [16] ; * ring dimension should be at least one larger than
maximum number of obs in days range;
array e_ring [16] ;
ix0 = 1;
do ix = 1 by 1 until (last.id);
set foo;
by id;
date0 = d_ring[ix0];
if . < date0 and date - date0 > 14 then do;
* if current date was end of date range, it would be
* larger than range of interest, thus data of date range of
* interest (starting at d_ring[ix0]) is contained in the ring;
sum14days = 0;
* compute value from data in ring array;
do i = ix0 by 1;
if i = ix then leave;
if d_ring[i] = . then leave;
if d_ring[i]-date0 > 14 then leave;
sum14days + e_ring[i];
* roll pointer, set to 0 because by 1 loop control will increment
before executing statements in loop;
if i >= dim(d_ring) then i = 0;
end;
output;
* update start pointer;
ix0 + 1;
if ix0 > dim(d_ring) then ix0 = 1;
end;
* insert data;
d_ring [ix] = date;
e_ring [ix] = events;
* update current pointer, set to 0 because 1 by 1 loop control will
increment before until () test;
if ix >= dim(d_ring) then ix = 0;
end;
* roll off anything left in the ring in reverse so as to not have to
perform extra summing loops;
* this means output will *NOT* be sorted in id,date order;
sum14days = 0;
do i = ix by -1 until (i eq ix0);
date0 = d_ring[i];
sum14days + e_ring[i];
output;
if i = 1 then i = dim(d_ring)+1 ;
end;
format date0 mmddyy10.;
run;
proc sql;
create table counts3 as
select foo.*
, date+14 as endate format=mmddyy10.
, sum14days as eventsSumOverDateRange
from foo, endsum
where foo.id = endsum.id
and foo.date = endsum.date0
;
quit;
*------------------------;
The ring array computation should perform roughly 14 * 1M inner loop
iterations while computing the rolling sums. If the date range was larger,
or the number of rows where larger, the computation could be sped up some
more by maintaining a 'state' sum to which things are added to and
subtracted from (reduces number of innermost looping)
You might also be able to get rolling sums from SAS/QC Proc MACONTROL or
other SAS/STAT Procs.
--
Richard A. DeVenezia
http://www.devenezia.com/downloads/sas/samples/