Date: Tue, 20 Feb 2001 08:24:36 -0800
Reply-To: "Lund, Pete" <Peter.Lund@WSIPP.WA.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Lund, Pete" <Peter.Lund@WSIPP.WA.GOV>
Subject: Re: Wanted: One pass summary solution
Content-Type: text/plain; charset="iso-8859-1"
Nigel-
Here's one possibility using arrays to keep track of the starts and
completes. You can adjust the upper and lower bounds of the arrays as
needed. The output in this example will contain records from the earliest
date to the latest date found in the data. You'll have to make a tweak to
the incrementing logic if there are any missing values.
/* Your test data */
data test;
input startdt : mmddyy. enddt : mmddyy.;
cards;
1/1/01 1/1/01
1/1/01 1/1/01
1/1/01 1/3/01
1/2/01 1/2/01
1/2/01 1/3/01
1/2/01 1/4/01
1/2/01 1/4/01
1/3/01 1/3/01
1/3/01 1/3/01
1/3/01 1/4/01
1/3/01 1/4/01
1/3/01 1/5/01
run;
/* Arbitrary lower and upper bounds */
%let lower = 14500;
%let upper = 15500;
data StartsAndEnds(keep=Date NumStarts NumCompletes);
set test end=done;
format date mmddyy10.;
retain Smallest 99999 Biggest -99999;
array Starts(&Lower:&Upper) _temporary_;
array Completes(&Lower:&Upper) _temporary_;
Starts(StartDt) + 1;
Completes(EndDt) + 1;
Smallest = min(Smallest,StartDt,EndDt);
Biggest = max(Biggest,StartDt,EndDt);
if done then
do i = Smallest to Biggest;
NumStarts = max(Starts(i),0);
NumCompletes = max(Completes(i),0);
Date = i;
output;
end;
run;
Hope this is helpful-
Pete Lund
-------------------------------------------
WA State Institute for Public Policy
110 East Fifth Avenue, Suite 214
PO Box 40999
Olympia, WA 98504-0999
(360) 586-9436 - voice
(360) 586-2793 - fax
(360) 280-4892 - cell
peter.lund@wsipp.wa.gov
----------------------------------------------------
-----Original Message-----
From: Nigel Tufnel [mailto:dousk8@HOTMAIL.COM]
Sent: Tuesday, February 20, 2001 6:04 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Wanted: One pass summary solution
I have a data set that I want to summarize by day. I use two primary
columns "startdate" and "completedate". I want the count the number of
starts and completes by day. To do this I summarize the data set twice
(once by startdate and again by completedate) and merge the two.
This data set has millions of records and a few hundred other variables. Is
there a way to get the desired results in one pass?
The data set looks like (among other columns):
startdate completedate
1/1/01 1/1/01
1/1/01 1/1/01
1/1/01 1/3/01
1/2/01 1/2/01
1/2/01 1/3/01
1/2/01 1/4/01
1/2/01 1/4/01
1/3/01 1/3/01
1/3/01 1/3/01
1/3/01 1/4/01
1/3/01 1/4/01
1/3/01 1/5/01
The desired result is:
date starts completes
1/1/01 3 2
1/2/01 4 1
1/3/01 5 4
1/4/01 0 4
1/5/01 0 1
Like I mentioned, the two pass solution is:
proc means data = source nway noprint missing;
class startdate;
var _whatever_;
output out=starts(rename=(_freq_ = starts
startdate = date))
sum=;
run;
proc means data = source nway noprint missing;
class completedate;
var _whatever_;
output out=completes(rename=(_freq_ = completes
completedate = date))
sum=;
run;
data summ;
merge startdate
completedate
;
by date;
run;
A one pass solution here could save a few hours. Any Ideas?
Thanks,
Nigel
_________________________________________________________________
Get your FREE download of MSN Explorer at http://explorer.msn.com