Date: Fri, 29 Oct 2010 15:22:26 -0700
Reply-To: Sterling Paramore <gnilrets@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sterling Paramore <gnilrets@GMAIL.COM>
Subject: Collapsing date and coverage records with no gaps
Content-Type: text/plain; charset=ISO-8859-1
Nice work, Dan! I ran my enrollment data through your method and it was in
perfect agreement with mine, but an order of magnitude faster! Too bad the
OP hasn't responded, but I'm learning a lot. Anyway, I think I came up with
my bloated solution because I knew it would have to be modified and expanded
to handle a bit more complicated case. I have enrollment records with three
different coverage types (M - Medical, D - Dental, X - Prescription). I
want to collapse overlapping spans and indicate all the coverage types that
were available at that time. For example:
data have;
length cov $1;
input id $ cov $ start_date :mmddyy10. stop_date :mmddyy10.;
format start_date stop_date yymmdd10.;
cards;
1 M 1/1/10 6/30/10
1 M 7/1/10 12/31/10
1 X 4/1/10 6/30/10
1 X 7/1/10 12/31/10
2 M 1/1/10 8/30/10
2 X 7/1/10 12/30/10
3 M 1/1/10 12/30/10
3 D 2/1/10 6/30/10
3 X 4/1/10 12/30/10
;
run;
WANT:
id start_date
stop_date multicov
1 2010-01-01
2010-03-31 M
1 2010-04-01
2010-12-31 MX
2 2010-01-01
2010-06-30 M
2 2010-07-01
2010-08-30 MX
2 2010-08-31
2010-12-30 X
3 2010-01-01
2010-01-31 M
3 2010-02-01
2010-03-31 DM
3 2010-04-01
2010-06-30 DMX
3 2010-07-01
2010-12-30 MX
After seeing your elegant solution to the previous problem, I wonder if my
solution to this one is just as bloated:
%macro Continuous_Spans_MultiCov(
in_libdsn = , /* INPUT dataset to process */
out_libdsn = , /* OUTPUT dataset with continuous spans */
group_vars = , /* INPUT space-delimited list of variables that identify
a group for which to process continuous spans */
cov_var = , /* INPUT name of coverage var - must NOT be part of
group_vars*/
multicov_var = Span_MultiCov_Cd, /* OUTPUT name of multicoverage code -
multicov cd always in the order DMVX */
start_var = , /* INPUT the name of the variable in &in_libdsn that
starts a span */
end_var = , /* INPUT the name of the variable in &in_libdsn that ends a
span */
worklib = WORK
) / store;
*** Create a format to convert a binary code into a text string ;
* that represents multiple coverage types;
data &worklib.._multiCov_fmt;
length FMTNAME $20.;
length Start 8;
length End 8;
length Type $1.;
length Label $4.;
do i = 0 to 15;
FMTNAME = "multiCov";
Type = "N";
Label = "____";
Start = i;
End = i;
if band(i,1) ^= 0 then
substr(Label,4,1) = "X";
if band(i,2) ^= 0 then
substr(Label,3,1) = "V";
if band(i,4) ^= 0 then
substr(Label,2,1) = "M";
if band(i,8) ^= 0 then
substr(Label,1,1) = "D";
Label = compress(Label,"_");
output;
end;
run;
proc format CNTLIN = &worklib.._multiCov_fmt library = WORK;
run;
** The complementary format converts a single coverage code into a
binary number;
proc format;
invalue $binCov
"X" = 1
"V" = 2
"M" = 4
"D" = 8
;
run;
data &out_libdsn (keep = &group_vars &start_var &end_var &multicov_var);
set &in_libdsn;
by &group_vars;
if _N_ = 1 then do;
length _order _cov $1;
length _date _start _end 8;
format _date _start _end DATE9.;
* Hash table to store each date per group;
* _order = S for start, E for end;
declare hash h_date(ordered:'a');
h_date.definekey('_date','_order','_cov');
h_date.definedone();
* Hash table to store individual spans to be grouped;
declare hash h_pair(ordered:'a');
h_pair.definekey('_cov','_start','_end');
h_pair.definedone();
* Iterator objects needed for traversing hash tables;
declare hiter hi_date('h_date');
declare hiter hi_pair('h_pair');
end;
if first.%scan(&group_vars,-1,%str( )) then do;
h_date.clear();
h_pair.clear();
end;
* Clear hash tables at the beginning of a new group;
h_date.ref(key:&start_var,key:"S",key:&cov_var,
data:&start_var,data:"S",data:&cov_var);
h_date.ref(key:&end_var,key:"E",key:&cov_var,
data:&end_var,data:"E",data:&cov_var);
h_pair.ref(key:&cov_var,key:&start_var,key:&end_var,
data:&cov_var,data:&start_var,data:&end_var);
length Span_Start Span_End endplus1 8;
format Span_Start Span_End endplus1 DATE9.;
length &multicov_var $4;
length start_multicov_bin end_multicov_bin endp1_multicov_bin 8;
format start_multicov_bin end_multicov_bin endp1_multicov_bin
MultiCov.;
if last.%scan(&group_vars,-1,%str( )) then do;
** Read the first start date;
* Note: if this is a zero-day span, this will actually be an E,
but treat it like an S;
rc_date = hi_date.first();
Span_Start = _date;
* Compute multicov code for start dates;
rc_pair = hi_pair.first();
start_multicov_bin = 0;
do while (rc_pair = 0);
if (Span_Start >= _start and Span_Start <= _end) then
start_multicov_bin =
bor(start_multicov_bin,input(_cov,$binCov.));
rc_pair = hi_pair.next();
end;
* Move to next date;
if _order = "S" then
rc_date = hi_date.next();
do while (rc_date = 0);
** Start dates;
if _order = "S" then do;
* See if this start date introduces a new coverage code;
* If so, then terminate previous multicov code and start
a new one;
* If not, then ignore start date;
if band(start_multicov_bin,input(_cov,$binCov.)) = 0
then do;
* Terminate previous multicov code on the previous
day;
Span_End = intnx('days',_date,-1);
&multicov_var = put(start_multicov_bin,MultiCov.);
&start_var = Span_Start;
&end_var = Span_End;
output &out_libdsn;
* Start new span and calculate multicov code on this
new start date;
Span_Start = _date;
rc_pair = hi_pair.first();
start_multicov_bin = 0;
do while (rc_pair = 0);
if (Span_Start >= _start and Span_Start <= _end)
then
start_multicov_bin =
bor(start_multicov_bin,input(_cov,$binCov.));
rc_pair = hi_pair.next();
end;
end;
* Move to next date point;
rc_date = hi_date.next();
end;
** End dates;
else if _order = "E" then do;
Span_End = _date;
endplus1 = intnx('days',_date,1);
* Determine multicov code for this end date and the day
after;
rc_pair = hi_pair.first();
end_multicov_bin = 0;
endp1_multicov_bin = 0;
do while (rc_pair = 0);
if (Span_End >= _start and Span_End <= _end) then
end_multicov_bin =
bor(end_multicov_bin,input(_cov,$binCov.));
if (endplus1 >= _start and endplus1 <= _end) then
endp1_multicov_bin =
bor(endp1_multicov_bin,input(_cov,$binCov.));
rc_pair = hi_pair.next();
end;
* If coverage code is unchanged, then just move to next
date;
if start_multicov_bin = end_multicov_bin
and start_multicov_bin = endp1_multicov_bin then do;
rc_date = hi_date.next();
end;
* If coverage changes on the end date;
* This can happen when new coverage starts the day old
coverage terms;
* Occurance of this may result in single-day coverage
spans;
if start_multicov_bin ^= end_multicov_bin then do;
* Terminate current multicov code;
Span_End = intnx('days',_date,-1);
&multicov_var = put(start_multicov_bin,MultiCov.);
&start_var = Span_Start;
&end_var = Span_End;
output &out_libdsn;
* Start new multicov span;
Span_Start = _date;
start_multicov_bin = end_multicov_bin;
* Move to next date on or after new Span Start if
multicov does not immediately change;
if end_multicov_bin = endp1_multicov_bin then do;
rc_date = hi_date.next();
do while (rc_date = 0 and _date < Span_Start);
rc_date = hi_date.next();
end;
end;
end;
* If coverage changes after the end date then output
this span and move to next date;
if end_multicov_bin ^= endp1_multicov_bin then do;
* Terminate current multicov code;
Span_End = _date;
&multicov_var = put(start_multicov_bin,MultiCov.);
&start_var = Span_Start;
&end_var = Span_End;
output &out_libdsn;
* Start new multicov span, if applicable;
if endp1_multicov_bin ^= 0 then do;
Span_Start = intnx('days',_date,1);
start_multicov_bin = endp1_multicov_bin;
* Move to next date on or after new Start Date;
rc_date = hi_date.next();
do while (rc_date = 0 and _date < Span_Start);
rc_date = hi_date.next();
end;
end;
* If there is no new multicov span, find next start
date and calc new multicov;
else do;
* Move to next start date;
do until (_order = "S" or rc_date ^= 0);
rc_date = hi_date.next();
end;
Span_Start = _date;
rc_pair = hi_pair.first();
start_multicov_bin = 0;
do while (rc_date = 0 and rc_pair = 0);
if (Span_Start >= _start and Span_Start <=
_end) then
start_multicov_bin =
bor(start_multicov_bin,input(_cov,$binCov.));
rc_pair = hi_pair.next();
end;
end;
end;
end; * End _order = E;
end; * End loop over enrollment spans;
* Remove pointers;
hi_date.last();
rc_d = hi_date.next();
hi_pair.last();
rc_e = hi_pair.next();
end; * End last-dot;
run;
%mend Continuous_Spans_MultiCov;
On Fri, Oct 29, 2010 at 1:33 PM, Nordlund, Dan (DSHS/RDA) <
NordlDJ@dshs.wa.gov> wrote:
> > -----Original Message-----
> > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
> > Søren Lassen
> > Sent: Friday, October 29, 2010 1:31 AM
> > To: SAS-L@LISTSERV.UGA.EDU
> > Subject: Re: Collapsing date records with no gaps
> >
> > David,
> > A large number of solutions have been proposed. Most of them
> > are variations on a DOW loop. I did not study the other solutions
> > in detail, but it seems to me that all the DOW solutions proposed
> > will output nothing if there is only one record for an ID.
> >
> > My suggested solution is this:
> >
> > proc sort data=have;
> > by id start_date stop_date;
> > /* Get rid of zero/negative length intervals */
> > where stop_date ge start_date or stop_date is null;
> > run;
> >
> > data want (keep=id start_date stop_date);
> > set have(rename=(start_date=_start stop_date=_stop));
> > by id;
> > format start_date stop_date mmddyy8.;
> > retain start_date stop_date;
> > select;
> > when(first.id) do;
> > start_date=_start;
> > stop_date=_stop;
> > end;
> > when(missing(stop_date)); /* never stops, "infinite" stop date */
> > when(stop_date ge _stop); /* nested intervals */
> > when (stop_date ge _start-1) do; /* adjacent or overlappping */
> > if missing(_stop) then
> > stop_date=.;
> > else
> > stop_date = max(_stop,stop_date);
> > end;
> > otherwise do;
> > output;
> > start_date=_start;
> > stop_date=_stop;
> > end;
> > end;
> > if last.id then output;
> > drop _:;
> > run;
> >
> > As you can see, I left out the DOW loop and put in a retain statement
> > instead. This is mostly a matter of personal taste, likewise the
> > use of SELECT instead of nested IF..THEN..ELSE statements.
> >
> > I added an extra twist: if stop_date is missing, it probably means
> > that the record is still valid (supposedly into eternity).
> >
> > Regards,
> > Søren
> >
>
> Sören,
>
> You are correct that my original solution did not handle the case of a
> single record for an id. I offer a corrected version which also handles
> overlapping and nested spans in addition to the case where a span begins the
> following day. It could easily be modified to handle the case where a stop
> _date was missing, but at first blush I would probably handle it differently
> than you if there were a following record for the id with a non-missing
> stop_date. But that brings me back to a comment that I made in response to
> Sterling. And that is, what to do with spans that overlap in various ways
> (or have missing values) depends on local business rules. It is unfortunate
> that the OP hasn't responded with clarification or told us how the problem
> was solved (if I missed that response, I apologize). But anyway, here is my
> modification.
>
> data have;
> input id start_date :mmddyy10. stop_date :mmddyy10.;
> format start_date stop_date yymmdd10.;
> cards;
> 1 1/1/10 1/30/10
> 1 1/30/10 2/15/10
> 1 2/13/10 3/15/10
> 1 3/16/10 3/30/10
> 2 5/1/10 5/30/10
> 2 7/1/10 7/15/10
> 3 7/15/10 7/30/10
> 4 5/1/10 7/30/10
> 4 7/1/10 7/15/10
> ;
> run;
> proc print;
> run;
>
> data want;
> do until(last.id);
> set have(rename=(start_date=_start stop_date=_stop));
> by id;
>
> if first.id then do;
> start_date=_start;
> stop_date=_stop;
> end;
> else if _start<=stop_date+1 then do;
> if _stop > stop_date then stop_date = _stop;
> end;
> else do;
> output;
> start_date=_start;
> stop_date=_stop;
> end;
> if last.id then output;
> end; /*end do until*/
>
> format start_date stop_date yymmdd10.;
> drop _: ;
> run;
>
> proc print;
> run;
>
> Hope this is helpful,
>
> Dan
>
> Daniel J. Nordlund
> Washington State Department of Social and Health Services
> Planning, Performance, and Accountability
> Research and Data Analysis Division
> Olympia, WA 98504-5204
>