|
On Tue, 10 Jul 2007 22:15:30 +1000, David Johnson <d@DKVJ.BIZ> wrote:
>With a set of dated records from another industry, we found the only way to
>reliably adjust the dates and model the period (given we had a lot of
>missing information) was to build an array. Each record dealt with one
>individual and the array elements held the dates. Multiple arrays are
>needed for the start and end dates, an identifier and a flag to record when
>a record had been adjusted. By stepping across the array, it was possible
>then to impute the missing or incorrect values.
>
>Once that was done, the data was restored to a record-wise structure by
>deconstructing the array. I haven't got the code, it belongs to the
client,
>but can give you some indicators.
>
>
>Firstly you need to know the maximum size of the array, so run some SQL to
>get the Max( Count( *) ) for any individual. That value stored in a
>macro symbol is used to define the maximum dimension of your array.
>
>Then in a data step you By-process the data, writing each record to the
next
>array positions, and when you get the end of the by group, you write the
>record out.
>
>With the new table, apply your rules across the array elements and set the
>"fix" flag when you have adjusted a date so you don't chase the same thing
>again. You should be able to walk across the array and verify your date
>sequences are correct before stepping onto the next individual. If you
have
>issues with the final data, you can also write the record to a "PROBLEMS"
>table to review manually.
>
>Kind regards
>
>David
>
>-----Original Message-----
>From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU]On Behalf Of Ole
>Rogeberg
>Sent: Tuesday, 10 July 2007 7:47 PM
>To: SAS-L@LISTSERV.UGA.EDU
>Subject: "Aligning" (comparing and altering) information across
>different lines
>
>
>Does anyone know of a good way to identify overlaps between time periods
>recorded in different lines of data, and manipulating these dates according
>to a set of rules?
>
>I have a large dataset containing work records (more than one million in
>total) for a group of individuals, and one individual may have several
lines
>of data (different work records). Each work record has a start date and an
>end-date, but these are not always set accurately. As a result, they may
>overlap, in which case I want to apply a set of rules to change the
recorded
>dates. For instance, a date in the middle of a year (e.g., 1 june) is more
>credible than beginning or end of year (1 january og 31 december).
>
>The problems, as I see them, are for instance that
>-information in a later line may require an earlier line for the same
>individual to be recorded. This makes it difficult to use a
>single-run-through using an "if first.idnr" construction
>-The period in record 2 for an individual may overlap with record 3 even if
>record 1 does not overlap with either of them. This means that it is
>insufficient to compare the first period with all the later periods.
>-The period in a record may overlap with more than one later record, and
the
>overlap need not be in the same place.
>
>
>The immediate ideas I had for solving the problem were rather messy, so I
>would be interested in hearing if anyone else has come across a similar
>question and have found simple ways of dealing with it (specialized SAS
>functions or procedures, clever loops etc.).
>
>Sincerely,
>
>Ole Rogeberg
Hei Ole, Hi david
I was thinking the same thing, see if this is a useful illustration:
data workers;
length fnr $11;
input fnr $ start :date9. end :date9. blablabla $;
cards;
20037621021 01JAN2006 31MAR2006 blablabl
20037621021 02APR2006 13AUG2006 blablabl
20037621021 01AUG2006 31DEC2006 blablabl
20038021021 01AUG2006 31JAN2007 blablabl
20038021021 01JAN2007 31DEC2007 blablabl
;
run;
data temp;
i=0;
array startdate [10];
array enddate [10];
array job [10];
array notes [10] $100;
do until (last.fnr); *get dates on one line;
i+1;
set workers (keep=fnr start end);
by fnr;
if first.fnr then jobnr=0;
jobnr+1; *adds a jobnumber;
job[i]=jobnr;
startdate[i]=start;
enddate[i]=end;
end;
do ii=1 to 10;
if startdate[ii]=. then leave;
/* rule 1*/
if 1<ii and startdate[ii] <= enddate[ii-1] then do; *
Sample rule: If startdate <= previous job's enddate, change it..;
notes[ii] = 'Startdate changed from '||put(startdate
[ii], date9.)|| ' to '||put(enddate[ii-1]+1, date9.);
startdate[ii] = enddate[ii-1]+1;
end;
/* rule 2*/
/* blablabla...*/
end;
do i=1 to 10; *bring it back;
if startdate[i]=. then leave;
start = startdate[i];
end = enddate[i];
note = notes[i];
jobnr=job[i];
output;
end;
run;
proc print data=temp;
var fnr jobnr start end note;
format start end date9.;
run;
*merge back dates on fnr og jobnr.
|