LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (November 2010, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 2 Nov 2010 17:12:57 -0700
Reply-To:     Sterling Paramore <gnilrets@GMAIL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Sterling Paramore <gnilrets@GMAIL.COM>
Subject:      Re: Collapsing date and coverage records with no gaps
In-Reply-To:  <201011022356.oA2IdVrF005393@willow.cc.uga.edu>
Content-Type: text/plain; charset=ISO-8859-1

Art,

This should work for your data (untested):

%Continuous_Spans_MultiCov( in_libdsn = have, /* INPUT dataset to process */ out_libdsn = want, /* OUTPUT dataset with continuous spans */ group_vars = id, /* INPUT space-delimited list of variables that identify a group for which to process continuous spans */ cov_var = cov, /* INPUT name of coverage var - must NOT be part of group_vars*/ multicov_var = Span_MultiCov_Cd, /* OUTPUT name of multicoverage code - multicov cd always in the order DMVX */ start_var = start_date, /* INPUT the name of the variable in &in_libdsn that starts a span */ end_var = stop_date, /* INPUT the name of the variable in &in_libdsn that ends a span */ worklib = WORK );

Yesterday I compared the speed of my method with Richard's interpretation of the Rhoads method, and they were almost identical (mine won by about 5%, but I only ran the test once). Although, I think some of the ideas Rhoads presented could probably help improve my code quite a bit.

-Sterling

On Tue, Nov 2, 2010 at 4:56 PM, Arthur Tabachneck <art297@netscape.net>wrote:

> Dan, Sterling, and anyone else who might be interested: > > Dan: I have to disagree (although, of course, I'm disagreeing with myself > as well!). Your method, with some minor revisons, actually runs 4 times > faster than the Richard's/Mike's proposed method. > > The problem with the original method was simply that it required too much > unnecessary processing time. Of course, I don't guarantee that my > revisions correctly match your original code's outcomes, and my revisions > most likely could be optimized even further than I was able to take them. > > Given 4 years worth of data, on about 500,000 patients, Richard's/Mikes > solution took almost 13 minutes, while my revision of your code took just > over 3 minutes, including a necessary sort. > > My revised code was: > > data have (keep=id cov start_date stop_date); > length cov $1; > input id cov $ start_date :mmddyy10. stop_date :mmddyy10.; > format start_date stop_date yymmdd10.; > output; > original_id=id; > do i=4 to 1500000 by 3; > id=original_id+i; > output; > end; > cards; > 1 M 1/1/10 6/30/10 > 1 M 7/1/10 12/31/10 > 1 X 4/1/10 6/30/10 > 1 X 7/1/10 12/31/10 > 2 M 1/1/10 8/30/10 > 2 X 7/1/10 12/30/10 > 3 M 1/1/10 12/30/10 > 3 D 2/1/10 6/30/10 > 3 X 4/1/10 12/30/10 > 1 M 1/1/09 6/30/09 > 1 M 7/1/09 12/31/09 > 1 X 4/1/09 6/30/09 > 1 X 7/1/09 12/31/09 > 2 M 1/1/09 8/30/09 > 2 X 7/1/09 12/30/09 > 3 M 1/1/09 12/30/09 > 3 D 2/1/09 6/30/09 > 3 X 4/1/09 12/30/09 > 1 M 1/1/08 6/30/08 > 1 M 7/1/08 12/31/08 > 1 X 4/1/08 6/30/08 > 1 X 7/1/08 12/31/08 > 2 M 1/1/08 8/30/08 > 2 X 7/1/08 12/30/08 > 3 M 1/1/08 12/30/08 > 3 D 2/1/08 6/30/08 > 3 X 4/1/08 12/30/08 > 1 M 1/1/07 6/30/07 > 1 M 7/1/07 12/31/07 > 1 X 4/1/07 6/30/07 > 1 X 7/1/07 12/31/07 > 2 M 1/1/07 8/30/07 > 2 X 7/1/07 12/30/07 > 3 M 1/1/07 12/30/07 > 3 D 2/1/07 6/30/07 > 3 X 4/1/07 12/30/07 > ; > > proc sort data=have; > by id; > run; > > **----get all dates and values needed-----**; > proc sql noprint; > select max(obs) into :obs > from (select count(*) as obs > from have > group by id) > ; > > select distinct cov, min(start_date) , max(stop_date) > into :cat separated by '', :begin, :end > from have > ; > quit; > > data want(keep=id start_date stop_date multicov); > set have; > by id; > **----temporary array is automatically retained----**; > array cov_a[&begin:&end] $%sysfunc(length(&cat.)) _temporary_ ; > array dates[%eval(&obs.*4)]; > array st_dates[%eval(&obs.)]; > array sp_dates[%eval(&obs.)]; > array covs[%eval(&obs.)] $1; > retain dates st_dates sp_dates covs; > > if first.id then do; > call missing(of dates(*),of st_dates(*),of sp_dates(*)); > counter=0; > end; > > counter+1; > st_dates(counter)=start_date; > sp_dates(counter)=stop_date; > covs(counter)=cov; > > if last.id then do; > > do _k_=1 to counter; > dates((_k_-1)*4+1)=st_dates(_k_); > dates((_k_-1)*4+2)=sp_dates(_k_); > if st_dates(_k_) gt min(of st_dates(*)) then > dates((_k_-1)*4+3)=st_dates(_k_)-1; > if sp_dates(_k_) lt max(of sp_dates(*)) then > dates((_k_-1)*4+4)=sp_dates(_k_)+1; > end; > > call sortn (of dates[*]); > do _k_=nmiss(of dates(*))+1 to &obs.*4-1; > if dates(_k_) eq dates(_k_+1) then call missing(dates(_k_)); > end; > call sortn (of dates[*]); > > do _i_= 1 to &obs.*4; > if not missing(dates(_i_)) then do; > cov_a[dates(_i_)]=''; > do _k_=1 to %sysfunc(length(&cat.)); > cov_a[dates(_i_)]=cov_a[dates(_i_)]||' '; > end; > do _k_=1 to counter; > if st_dates(_k_) le dates(_i_) le sp_dates(_k_) then do; > substr(cov_a[dates(_i_)],index("&cat.",covs[_k_]),1) = > covs[_k_]; > end; > end; > end; > end; > > do _n_ = nmiss(of dates(*))+1 to &obs.*4-1; > start_date = dates(_n_); > do _k_ = _n_+1 to &obs.*4; > if cov_a[dates[_n_]] NE cov_a[dates[_k_]] then do; > stop_date = dates[_k_-1]; > multicov = compress(cov_a[dates[_k_-1]]); > if multicov NE '' then output; > _n_=_k_-1; > _k_=&obs.*4; > end; > else if _k_ eq &obs.*4 then do; > stop_date = dates[_k_]; > multicov = compress(cov_a[dates[_k_]]); > if multicov NE '' then output; > _n_=_k_-1; > end; > end; > end; > end; > format start_date stop_date date: mmddyy10.; > run; > > Thanks to an offline note from Howard, some of the proc sql's required > time was shaved. > > Sterling: I'd still like you to post your actual macro call statement, as > I'd like to see how your original solution compares with both the Dan/mine > and Richard/Mike solutions. > > Art > ------- > On Sun, 31 Oct 2010 16:51:12 -0700, Daniel Nordlund > <djnordlund@FRONTIER.COM> wrote: > > >> -----Original Message----- > >> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of > >> Daniel Nordlund > >> Sent: Saturday, October 30, 2010 1:08 PM > >> To: SAS-L@LISTSERV.UGA.EDU > >> Subject: Re: Collapsing date and coverage records with no gaps > >> > >> > -----Original Message----- > >> > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of > >> > Sterling Paramore > >> > Sent: Saturday, October 30, 2010 9:26 AM > >> > To: SAS-L@LISTSERV.UGA.EDU > >> > Subject: Re: Collapsing date and coverage records with no gaps > >> > > >> > I thought about converting coverage spans into distinct days of > coverage > >> > like these solutions, but my enrollment data is already pretty > large. I > >> > just estimated that if I did that, I'd end up having to process 2.2 > >> > billion > >> > records, rather than the 1.5 million that I have (500,000 members X 3 > >> > coverage types X 4 years). I look forward to trying your solution > when > >> I > >> > get back to work Monday. > >> > > >> > >> Sterling, > >> > >> I have not benchmarked my approach against any other options, so I am > not > >> going to make any claims for whether it is better in some sense than the > >> others that have been suggested. But let me correct an apparent > >> misunderstanding. No new records need to be created. If you have 1.5 > >> million records, you will only need to read your 1.5 million records > once. > >> The array used for holding four years of eligibility will only take 365 > >> days * 4 years * 3 characters (coverage type), or about 4380 bytes of > >> memory. You could handle 20 years of eligibility in less than 22,000 > >> bytes. Art's suggestion of using SQL to get the earliest and latest > dates > >> in your file would allow you to tailor the size of the coverage array to > >> fit a particular span of dates. However, I am not sure that making an > >> extra pass through a large file is worth the time. All you will be > saving > >> is a small amount of memory and a little array processing time. So I > >> would just make the array longer than necessary. > >> > >> Whatever approach you choose, best of luck in wrestling your eligibility > >> data to the ground. > >> > > > >Well, given the tentative results reported by Art and some benchmarking > that I have done, it doesn't look to me like the approach that I suggested > scales as well as I had hoped. At this point I would be inclined to go > with the Rhoads method as interpreted by Richard Allen unless someone > comes up with a faster, clearer approach. > > > >Dan > > > >Daniel Nordlund > >Bothell, WA USA >


Back to: Top of message | Previous page | Main SAS-L page