Date: Fri, 29 Oct 2010 17:21:37 +0000
Reply-To: "Keintz, H. Mark" <mkeintz@WHARTON.UPENN.EDU>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Keintz, H. Mark" <mkeintz@WHARTON.UPENN.EDU>
Subject: Re: Collapsing date records with no gaps
In-Reply-To: <201010282251.o9SMHR1T005071@willow.cc.uga.edu>
Content-Type: text/plain; charset="us-ascii"
Art, thanks.
I reproduced your results, but unbeliever that I am, I figured it couldn't actually be the lookahead method that slowed things down, but rather my use of the MIN function that degraded performance. Only partly true.
I replaced the MINs by more resource-effective code and reduced it to 50% slower than yours. For those who might be interested, here's the look-ahead method, with improved efficiency. But it's not as fast as DOW. Hmm.
Regards,
Mark
data want (keep=id startseries stop_date rename=(startseries=start_date));
retain startseries; format startseries mmddyy8.;
set have;
by id;
** Now look ahead, with wraparound to avoid premature end-of-input signal **;
set have (firstobs=2 keep=start_date rename=(start_date=nextstart))
have (obs=1 keep=start_date rename=(start_date=nextstart));
startseries=coalesce(startseries,start_date);
if last.id=1 or nextstart ^= stop_date then do;
output;
startseries=nextstart;
end;
run;
> -----Original Message-----
> From: Arthur Tabachneck [mailto:art297@NETSCAPE.NET]
> Sent: Thursday, October 28, 2010 6:52 PM
> To: SAS-L@LISTSERV.UGA.EDU; Keintz, H. Mark
> Subject: Re: Collapsing date records with no gaps
>
> Mark,
>
> I had compared the two before submitting the post. My comparison test
> was:
>
> data have (drop=i);
> infile cards ;
> input id (start_date stop_date) (mmddyy8.);
> format start_date stop_date mmddyy8.;
> output;
> do i=1 to 10000;
> id=id*2+i;
> output;
> end;
> cards;
> 1 1/1/10 1/30/10
> 1 1/30/10 2/15/10
> 1 2/15/10 3/15/10
> 1 3/16/10 4/15/10
> 2 5/1/10 5/30/10
> 2 7/1/10 7/15/10
> ;
> run ;
>
> proc sort data=have;
> by id start_date;
> run;
>
> data want (keep=id start_date stop_date);
> do until(last.id);
> set have(rename=(start_date=_start stop_date=_stop));
> by id;
> format start_date stop_date mmddyy8.;
> if first.id then do;
> start_date=_start;
> stop_date=_stop;
> end;
> else if stop_date eq _start then do;
> stop_date = _stop;
> if last.id then output;
> end;
> else if stop_date gt _stop then do;
> _stop=stop_date;
> if last.id then output;
> end;
> else do;
> output;
> start_date=_start;
> stop_date=_stop;
> if last.id then output;
> end;
> end;
> run;
>
> data want2 (drop=startseries nextstart);
> retain startseries;
> set have;
> by id;
>
> ** Now look ahead, with wraparound to avoid premature end-of-input
> signal **;
> set have (firstobs=2 keep=start_date rename=(start_date=nextstart))
> have (obs=1 keep=start_date rename=(start_date=nextstart));
>
> if last.id=1 or nextstart ^= stop_date then do;
> start_date=min(start_date,startseries);
> output;
> startseries=.;
> end;
>
> else startseries=min(startseries,start_date);
> run;
>
> My results, given numerous tests in various orders, were around 4
> seconds
> for the first method, and around 9 seconds for the second method. I
> was
> running 9.1.3 m3 on a 4-processor windows 2003 server.
>
> Art
> -------
> On Thu, 28 Oct 2010 22:21:57 +0000, Keintz, H. Mark
> <mkeintz@WHARTON.UPENN.EDU> wrote:
>
> >Art:
> >
> >I agree that for many people, the DOW approach is as easy to
> understand
> as the lookahead method, and I can see why someone would prefer it.
> For
> one thing, it probably will be more robust if the OP's specifications
> are
> expanded.
> >
> >But using "half the IO time" is, in this situation, is very unlikely.
> Yes, in my offering, dataset HAVE is being read twice, but the two read
> streams are essentially in parallel (one stream reads record I, the
> other
> record I+1). So I believe one SET statement will merely read from the
> operating system's I/O cache generated by servicing the other SET
> statement.
> >
> >Regards,
> >Mark
> >
> >
> >> -----Original Message-----
> >> From: Arthur Tabachneck [mailto:art297@NETSCAPE.NET]
> >> Sent: Thursday, October 28, 2010 5:43 PM
> >> To: SAS-L@LISTSERV.UGA.EDU; Keintz, H. Mark
> >> Subject: Re: Collapsing date records with no gaps
> >>
> >> Mark,
> >>
> >> Of course the problem doesn't "require" the use of a DOW loop, but I
> >> think
> >> it is the most efficient of the approaches that were offered.
> >>
> >> I was originally going to suggest the look-ahead method, but then
> never
> >> sent it after seeing Dan's code.
> >>
> >> However, to address Sterling's comment/question, I think we all
> accept
> >> something that gets the job done and, even better, code that someone
> >> else
> >> can understand. I'm sure that much of my own code would be
> considered
> >> fluff by many, but as long as it does that which was intended, does
> it
> >> correctly, and allows me to me a deadline, I'm quite happy with it.
> >>
> >> Why I prefer Dan's code, in this case (over the look ahead method),
> is
> >> that it only requires half the IO time and is no harder to follow
> than
> >> any
> >> of the other methods suggested.
> >>
> >> Art
> >> -------
> >> On Thu, 28 Oct 2010 13:09:42 +0000, Keintz, H. Mark
> >> <mkeintz@WHARTON.UPENN.EDU> wrote:
> >>
> >> >Folks:
> >> >
> >> >I don't think this problem as stated requires use of a DOW loop
> >> approach. It looks to me like a simple look-ahead task, in which a
> >> record
> >> is output if:
> >> > 1. The current record is the last for an ID, or
> >> > 2. the next record does not have start_date = this record's
> >> stop_date.
> >> >
> >> >This plus a way to RETAIN the start_date for the current record
> series
> >> should be enough:
> >> >
> >> >
> >> >data want (drop=startseries nextstart);
> >> > retain startseries;
> >> > set have;
> >> > by id;
> >> >
> >> > ** Now look ahead, with wraparound to avoid premature end-of-
> input
> >> signal **;
> >> > set have (firstobs=2 keep=start_date
> rename=(start_date=nextstart))
> >> > have (obs=1 keep=start_date
> rename=(start_date=nextstart));
> >> >
> >> > if last.id=1 or nextstart ^= stop_date then do;
> >> > start_date=min(start_date,startseries);
> >> > output;
> >> > startseries=.;
> >> > end;
> >> >
> >> > else startseries=min(startseries,start_date);
> >> >run;
> >> >
> >> >
> >> >Regards,
> >> >Mark
> >> >
> >> >
> >> >> -----Original Message-----
> >> >> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf
> Of
> >> >> Arthur Tabachneck
> >> >> Sent: Wednesday, October 27, 2010 11:52 PM
> >> >> To: SAS-L@LISTSERV.UGA.EDU
> >> >> Subject: Re: Collapsing date records with no gaps
> >> >>
> >> >> Shyam,
> >> >>
> >> >> If that is what the OP wants, a minor modification to one line of
> >> Dan's
> >> >> code will produce the same result. That is:
> >> >>
> >> >> data have ;
> >> >> infile cards ;
> >> >> input id (start_date stop_date) (mmddyy8.);
> >> >> format start_date stop_date mmddyy8.;
> >> >> cards;
> >> >> 1 1/1/10 1/30/10
> >> >> 1 1/30/10 2/15/10
> >> >> 1 2/15/10 3/15/10
> >> >> 1 3/16/10 4/15/10
> >> >> 2 5/1/10 5/30/10
> >> >> 2 7/1/10 7/15/10
> >> >> ;
> >> >> run ;
> >> >>
> >> >> data want (keep=id start_date stop_date);
> >> >> do until(last.id);
> >> >> set have(rename=(start_date=_start stop_date=_stop));
> >> >> by id;
> >> >> format start_date stop_date mmddyy8.;
> >> >> if first.id then do;
> >> >> start_date=_start;
> >> >> stop_date=_stop;
> >> >> end;
> >> >> else if stop_date eq _start or stop_date+1 eq _start then do;
> >> >> stop_date = _stop;
> >> >> if last.id then output;
> >> >> end;
> >> >> else do;
> >> >> output;
> >> >> start_date=_start;
> >> >> stop_date=_stop;
> >> >> if last.id then output;
> >> >> end;
> >> >> end;
> >> >> run;
> >> >>
> >> >> HTH,
> >> >> Art
> >> >> --------
> >> >> On Wed, 27 Oct 2010 22:50:01 -0400, shyamprasad samisetti
> >> >> <shyam.cbz@GMAIL.COM> wrote:
> >> >>
> >> >> >How about, if there is a record with a day after the stop date
> >> >> too........?
> >> >> >I am adding an extra record in addition to the record added by
> >> Daniel
> >> >> >Nordlund. The next start date is a day after stop date.
> >> >> >
> >> >> >Data Rec;
> >> >> > Input Id Startdt :mmddyy10. Stopdt :mmddyy10.;
> >> >> > Datalines;
> >> >> >1 1/1/10 1/30/10
> >> >> >1 1/30/10 2/15/10
> >> >> >1 2/15/10 3/15/10
> >> >> >1 3/16/10 4/15/10
> >> >> >2 5/1/10 5/30/10
> >> >> >2 7/1/10 7/15/10
> >> >> >;
> >> >> >run;
> >> >> >
> >> >> > Proc Sort;
> >> >> > By Id;
> >> >> > Format Startdt Stopdt Date9.;
> >> >> >Run;
> >> >> >
> >> >> >Data Rec1;
> >> >> > SEt Rec;
> >> >> > by Id;
> >> >> > Lagdate1 = Lag(Startdt);
> >> >> > Lagdate2 = Lag(stopdt);
> >> >> > Format Startdt Stopdt Newstart newend Lag: date9.;
> >> >> > Retain newstart newend;
> >> >> > if first.Id then
> >> >> >Do;
> >> >> > Lagdate1 = .;
> >> >> > Lagdate2 = .;
> >> >> > Newstart = Startdt;
> >> >> > Newend = Stopdt;
> >> >> >End;
> >> >> >
> >> >> >If lagdate2 = Startdt Then Newend = Stopdt;
> >> >> >ELSE IF LAGDATE2 + 1 = STARTDT THEN NEWEND = STOPDT; /*looks
> for
> >> the
> >> >> record
> >> >> >with startdt > by 1 day for previous stopdt*/
> >> >> >Else do;
> >> >> > Newstart = Startdt;
> >> >> > Newend = Stopdt;
> >> >> >End;
> >> >> >Run;**
> >> >> >Proc Sort Data = Rec1
> >> >> > out = Rec2;
> >> >> >
> >> >> > By id Newstart;
> >> >> >Run;
> >> >> >Data Final(Keep = id newstart newend);
> >> >> > Set Rec2;
> >> >> > by id newstart;
> >> >> > If Last.Newstart;
> >> >> >Run;
> >> >> >
> >> >> >Hope this helps.
> >> >> >-Shyam.
> >> >> >
> >> >> >On Wed, Oct 27, 2010 at 8:14 PM, Nordlund, Dan (DSHS/RDA) <
> >> >> >NordlDJ@dshs.wa.gov> wrote:
> >> >> >
> >> >> >> > -----Original Message-----
> >> >> >> > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On
> >> Behalf
> >> >> Of
> >> >> >> > David Friedman
> >> >> >> > Sent: Wednesday, October 27, 2010 4:10 PM
> >> >> >> > To: SAS-L@LISTSERV.UGA.EDU
> >> >> >> > Subject: Collapsing date records with no gaps
> >> >> >> >
> >> >> >> > I have a set of records and I want to collapse them by id
> into
> >> one
> >> >> >> > record if
> >> >> >> > the stop date of the preceding record = the start date of
> the
> >> next
> >> >> >> > record.
> >> >> >> > So in the case below only id=1 would be collapsed since the
> >> >> preceding
> >> >> >> > record
> >> >> >> > has a stop date of 1/30/10 and the start date of the next
> >> record
> >> >> is
> >> >> >> > also
> >> >> >> > 1/30/10. id #2 would not collapse since this condition is
> not
> >> met,
> >> >> i.e.
> >> >> >> > 5/30/10 not equal to 7/1/10.
> >> >> >> >
> >> >> >> > Input:
> >> >> >> >
> >> >> >> > id start date stop date
> >> >> >> > -- ---------- ---------
> >> >> >> > 1 1/1/10 1/30/10
> >> >> >> > 1 1/30/10 2/15/10
> >> >> >> > 2 5/1/10 5/30/10
> >> >> >> > 2 7/1/10 7/15/10
> >> >> >> >
> >> >> >> > Desired output:
> >> >> >> >
> >> >> >> > id start date stop date
> >> >> >> > -- ---------- ---------
> >> >> >> > 1 1/1/10 2/15/10
> >> >> >> > 2 5/1/10 5/30/10
> >> >> >> > 2 7/1/10 7/15/10
> >> >> >>
> >> >> >> David,
> >> >> >>
> >> >> >> I added an additional record for id=1 in case you had more
> than 2
> >> >> records
> >> >> >> that needed to be collapsed. Something like the following
> should
> >> >> work.
> >> >> >>
> >> >> >> data have;
> >> >> >> input id start_date :mmddyy10. stop_date :mmddyy10.;
> >> >> >> cards;
> >> >> >> 1 1/1/10 1/30/10
> >> >> >> 1 1/30/10 2/15/10
> >> >> >> 1 2/15/10 3/15/10
> >> >> >> 2 5/1/10 5/30/10
> >> >> >> 2 7/1/10 7/15/10
> >> >> >> ;
> >> >> >> run;
> >> >> >>
> >> >> >> data want;
> >> >> >> do until(last.id);
> >> >> >> set have(rename=(start_date=_start stop_date=_stop));
> >> >> >> by id;
> >> >> >>
> >> >> >> if first.id then do;
> >> >> >> start_date=_start;
> >> >> >> stop_date=_stop;
> >> >> >> end;
> >> >> >> else if _start=stop_date then do;
> >> >> >> stop_date = _stop;
> >> >> >> if last.id then output;
> >> >> >> end;
> >> >> >> else do;
> >> >> >> output;
> >> >> >> start_date=_start;
> >> >> >> stop_date=_stop;
> >> >> >> if last.id then output;
> >> >> >> end;
> >> >> >> end;
> >> >> >> run;
> >> >> >>
> >> >> >> Hope this is helpful,
> >> >> >>
> >> >> >> Dan
> >> >> >>
> >> >> >> Daniel J. Nordlund
> >> >> >> Washington State Department of Social and Health Services
> >> >> >> Planning, Performance, and Accountability
> >> >> >> Research and Data Analysis Division
> >> >> >> Olympia, WA 98504-5204
> >> >> >>
|