|
On Fri, 5 Sep 2008 18:18:42 GMT, Jay Weedon <jweedon@EARTHLINK.NET> wrote:
>On Fri, 05 Sep 2008 18:03:34 GMT, Jay Weedon <jweedon@earthlink.net>
>wrote:
>
>>Hi folks,
>>
>>Guess I must be losing my grip of SAS programming, because I can't
>>come up with an efficient solution to this easy-sounding problem.
>>
>>The master dataset A contains two date fields (the dates vary across
>>records). The dataset to be consulted B contains one date field and
>>one result field. The goal is for A to acquire 2 new fields: one
>>representing the sum of all the results in B that occur within the
>>window specified by the dates in A; the other is the number of
>>observations in B from which that mean is computed.
>
>I meant "sum" rather than "mean" at the end there. Mean won't make
>sense if no observations are found that lie in the window.
>
>Oh, and I forgot to mention that it needs to be a fairly efficient
>solution because there are millions of records in both sets.
>
>JW
Even if the records in B span several decades, that's still only a few
thousand distinct dates, so aggregating B to a summary with one count and
one sum per day should shrink the problem by 2+ orders of magnitude. I would
go further and transform the summary to a cumulative form. That way the
processing for each observation in A will require accessing just 2 days'
data from the summary (the closing date of the window and the day preceding
the opening date of the window). I would construct the arrays so that the
dates themselves serve as array indexes.
|