Date: Tue, 2 Nov 2010 17:12:57 -0700
Reply-To: Sterling Paramore <gnilrets@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sterling Paramore <gnilrets@GMAIL.COM>
Subject: Re: Collapsing date and coverage records with no gaps
In-Reply-To: <201011022356.oA2IdVrF005393@willow.cc.uga.edu>
Content-Type: text/plain; charset=ISO-8859-1
Art,
This should work for your data (untested):
%Continuous_Spans_MultiCov(
in_libdsn = have, /* INPUT dataset to process */
out_libdsn = want, /* OUTPUT dataset with continuous spans */
group_vars = id, /* INPUT space-delimited list of variables that
identify a group for which to process continuous spans */
cov_var = cov, /* INPUT name of coverage var - must NOT be part of
group_vars*/
multicov_var = Span_MultiCov_Cd, /* OUTPUT name of multicoverage code -
multicov cd always in the order DMVX */
start_var = start_date, /* INPUT the name of the variable in &in_libdsn
that starts a span */
end_var = stop_date, /* INPUT the name of the variable in &in_libdsn
that ends a span */
worklib = WORK
);
Yesterday I compared the speed of my method with Richard's interpretation of
the Rhoads method, and they were almost identical (mine won by about 5%, but
I only ran the test once). Although, I think some of the ideas Rhoads
presented could probably help improve my code quite a bit.
-Sterling
On Tue, Nov 2, 2010 at 4:56 PM, Arthur Tabachneck <art297@netscape.net>wrote:
> Dan, Sterling, and anyone else who might be interested:
>
> Dan: I have to disagree (although, of course, I'm disagreeing with myself
> as well!). Your method, with some minor revisons, actually runs 4 times
> faster than the Richard's/Mike's proposed method.
>
> The problem with the original method was simply that it required too much
> unnecessary processing time. Of course, I don't guarantee that my
> revisions correctly match your original code's outcomes, and my revisions
> most likely could be optimized even further than I was able to take them.
>
> Given 4 years worth of data, on about 500,000 patients, Richard's/Mikes
> solution took almost 13 minutes, while my revision of your code took just
> over 3 minutes, including a necessary sort.
>
> My revised code was:
>
> data have (keep=id cov start_date stop_date);
> length cov $1;
> input id cov $ start_date :mmddyy10. stop_date :mmddyy10.;
> format start_date stop_date yymmdd10.;
> output;
> original_id=id;
> do i=4 to 1500000 by 3;
> id=original_id+i;
> output;
> end;
> cards;
> 1 M 1/1/10 6/30/10
> 1 M 7/1/10 12/31/10
> 1 X 4/1/10 6/30/10
> 1 X 7/1/10 12/31/10
> 2 M 1/1/10 8/30/10
> 2 X 7/1/10 12/30/10
> 3 M 1/1/10 12/30/10
> 3 D 2/1/10 6/30/10
> 3 X 4/1/10 12/30/10
> 1 M 1/1/09 6/30/09
> 1 M 7/1/09 12/31/09
> 1 X 4/1/09 6/30/09
> 1 X 7/1/09 12/31/09
> 2 M 1/1/09 8/30/09
> 2 X 7/1/09 12/30/09
> 3 M 1/1/09 12/30/09
> 3 D 2/1/09 6/30/09
> 3 X 4/1/09 12/30/09
> 1 M 1/1/08 6/30/08
> 1 M 7/1/08 12/31/08
> 1 X 4/1/08 6/30/08
> 1 X 7/1/08 12/31/08
> 2 M 1/1/08 8/30/08
> 2 X 7/1/08 12/30/08
> 3 M 1/1/08 12/30/08
> 3 D 2/1/08 6/30/08
> 3 X 4/1/08 12/30/08
> 1 M 1/1/07 6/30/07
> 1 M 7/1/07 12/31/07
> 1 X 4/1/07 6/30/07
> 1 X 7/1/07 12/31/07
> 2 M 1/1/07 8/30/07
> 2 X 7/1/07 12/30/07
> 3 M 1/1/07 12/30/07
> 3 D 2/1/07 6/30/07
> 3 X 4/1/07 12/30/07
> ;
>
> proc sort data=have;
> by id;
> run;
>
> **----get all dates and values needed-----**;
> proc sql noprint;
> select max(obs) into :obs
> from (select count(*) as obs
> from have
> group by id)
> ;
>
> select distinct cov, min(start_date) , max(stop_date)
> into :cat separated by '', :begin, :end
> from have
> ;
> quit;
>
> data want(keep=id start_date stop_date multicov);
> set have;
> by id;
> **----temporary array is automatically retained----**;
> array cov_a[&begin:&end] $%sysfunc(length(&cat.)) _temporary_ ;
> array dates[%eval(&obs.*4)];
> array st_dates[%eval(&obs.)];
> array sp_dates[%eval(&obs.)];
> array covs[%eval(&obs.)] $1;
> retain dates st_dates sp_dates covs;
>
> if first.id then do;
> call missing(of dates(*),of st_dates(*),of sp_dates(*));
> counter=0;
> end;
>
> counter+1;
> st_dates(counter)=start_date;
> sp_dates(counter)=stop_date;
> covs(counter)=cov;
>
> if last.id then do;
>
> do _k_=1 to counter;
> dates((_k_-1)*4+1)=st_dates(_k_);
> dates((_k_-1)*4+2)=sp_dates(_k_);
> if st_dates(_k_) gt min(of st_dates(*)) then
> dates((_k_-1)*4+3)=st_dates(_k_)-1;
> if sp_dates(_k_) lt max(of sp_dates(*)) then
> dates((_k_-1)*4+4)=sp_dates(_k_)+1;
> end;
>
> call sortn (of dates[*]);
> do _k_=nmiss(of dates(*))+1 to &obs.*4-1;
> if dates(_k_) eq dates(_k_+1) then call missing(dates(_k_));
> end;
> call sortn (of dates[*]);
>
> do _i_= 1 to &obs.*4;
> if not missing(dates(_i_)) then do;
> cov_a[dates(_i_)]='';
> do _k_=1 to %sysfunc(length(&cat.));
> cov_a[dates(_i_)]=cov_a[dates(_i_)]||' ';
> end;
> do _k_=1 to counter;
> if st_dates(_k_) le dates(_i_) le sp_dates(_k_) then do;
> substr(cov_a[dates(_i_)],index("&cat.",covs[_k_]),1) =
> covs[_k_];
> end;
> end;
> end;
> end;
>
> do _n_ = nmiss(of dates(*))+1 to &obs.*4-1;
> start_date = dates(_n_);
> do _k_ = _n_+1 to &obs.*4;
> if cov_a[dates[_n_]] NE cov_a[dates[_k_]] then do;
> stop_date = dates[_k_-1];
> multicov = compress(cov_a[dates[_k_-1]]);
> if multicov NE '' then output;
> _n_=_k_-1;
> _k_=&obs.*4;
> end;
> else if _k_ eq &obs.*4 then do;
> stop_date = dates[_k_];
> multicov = compress(cov_a[dates[_k_]]);
> if multicov NE '' then output;
> _n_=_k_-1;
> end;
> end;
> end;
> end;
> format start_date stop_date date: mmddyy10.;
> run;
>
> Thanks to an offline note from Howard, some of the proc sql's required
> time was shaved.
>
> Sterling: I'd still like you to post your actual macro call statement, as
> I'd like to see how your original solution compares with both the Dan/mine
> and Richard/Mike solutions.
>
> Art
> -------
> On Sun, 31 Oct 2010 16:51:12 -0700, Daniel Nordlund
> <djnordlund@FRONTIER.COM> wrote:
>
> >> -----Original Message-----
> >> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
> >> Daniel Nordlund
> >> Sent: Saturday, October 30, 2010 1:08 PM
> >> To: SAS-L@LISTSERV.UGA.EDU
> >> Subject: Re: Collapsing date and coverage records with no gaps
> >>
> >> > -----Original Message-----
> >> > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
> >> > Sterling Paramore
> >> > Sent: Saturday, October 30, 2010 9:26 AM
> >> > To: SAS-L@LISTSERV.UGA.EDU
> >> > Subject: Re: Collapsing date and coverage records with no gaps
> >> >
> >> > I thought about converting coverage spans into distinct days of
> coverage
> >> > like these solutions, but my enrollment data is already pretty
> large. I
> >> > just estimated that if I did that, I'd end up having to process 2.2
> >> > billion
> >> > records, rather than the 1.5 million that I have (500,000 members X 3
> >> > coverage types X 4 years). I look forward to trying your solution
> when
> >> I
> >> > get back to work Monday.
> >> >
> >>
> >> Sterling,
> >>
> >> I have not benchmarked my approach against any other options, so I am
> not
> >> going to make any claims for whether it is better in some sense than the
> >> others that have been suggested. But let me correct an apparent
> >> misunderstanding. No new records need to be created. If you have 1.5
> >> million records, you will only need to read your 1.5 million records
> once.
> >> The array used for holding four years of eligibility will only take 365
> >> days * 4 years * 3 characters (coverage type), or about 4380 bytes of
> >> memory. You could handle 20 years of eligibility in less than 22,000
> >> bytes. Art's suggestion of using SQL to get the earliest and latest
> dates
> >> in your file would allow you to tailor the size of the coverage array to
> >> fit a particular span of dates. However, I am not sure that making an
> >> extra pass through a large file is worth the time. All you will be
> saving
> >> is a small amount of memory and a little array processing time. So I
> >> would just make the array longer than necessary.
> >>
> >> Whatever approach you choose, best of luck in wrestling your eligibility
> >> data to the ground.
> >>
> >
> >Well, given the tentative results reported by Art and some benchmarking
> that I have done, it doesn't look to me like the approach that I suggested
> scales as well as I had hoped. At this point I would be inclined to go
> with the Rhoads method as interpreted by Richard Allen unless someone
> comes up with a faster, clearer approach.
> >
> >Dan
> >
> >Daniel Nordlund
> >Bothell, WA USA
>
|