Date: Fri, 27 Sep 2002 12:01:43 -0400
Reply-To: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject: Re: FW: Merging with most closely dated record
Content-Type: text/plain; charset="iso-8859-1"
I admire the example of how far you can take CASE ... WHEN ... clauses in a
SELECT statement. I wonder whether the SAS format (or, for SQL sans SAS, a
look-up table) or the formula would generalize best.
In the meantime, take a look at this attempt to take care of the case of
more that one personnel record dated within a calendar month interval:
proc sql;
select distinct * from
( select t1.ID,t1.Dt format=mmddyy10.,t2.recDt format=mmddyy10.
from Surveys as t1 left join Persons as t2
on t1.ID=t2.PID group by t1.ID
having put(intck('MONTH','01jan1960'd,recDt)-intck
('month','01jan1960'd,Dt),dif.)=
max(put(intck('MONTH','01jan1960'd,recDt)-intck
('month','01jan1960'd,Dt),dif.))
and not put(intck('MONTH','01jan1960'd,recDt)-intck
('month','01jan1960'd,Dt),dif.)='0'
)
group by ID,Dt having (Dt-recDt)=min(Dt-recDt)
;
quit;
This solution might still fail if two personnel records within the same
calendar month have different information that an extension of this program
might bring into the the yield of the inner query.
These solutions bring to the foreground the purpose of the 'RELVAR TYPE' of
a query (see Date and Darwin's recent book on database theory). We can
either control the dimensions of the yield of a query by controlling the
properties of data or by programming methods of collapsing multiple rows
into single rows. The latter takes much more skill and knowledge, I believe.
... better to control properties of data so simple solutions produce correct
results. Good database design and continuing constraints on updates,
deletions, and insertions controls essential properties of data.
Sig
-----Original Message-----
From: Howard_Schreier@ITA.DOC.GOV [mailto:Howard_Schreier@ITA.DOC.GOV]
Sent: Friday, September 27, 2002 11:27 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: FW: Merging with most closely dated record
I think defining a format to implement the decision rule is a good
technique. But in this case it's also possible to do it through a formula.
First, let's expand the test data to include things like mismatches,
multiple matches, and matches outside David's six-month window.
data svy;
input ssn $ srvy_dt datetime22.3 ;
format srvy_dt datetime.;
cards;
123123123 18JUN2001:00:00:00.000
123123123 18JUL2001:00:00:00.000
234234234 19JUN2001:00:00:00.000
345345345 20SEP2001:00:00:00.000
453453453 20SEP2001:00:00:00.000
567567567 29JUN2001:00:00:00.000
678678678 15OCT2001:00:00:00.000
678678678 16OCT2001:00:00:00.000
789789789 21NOV2001:00:00:00.000
890890890 11DEC2001:00:00:00.000
;
data prec;
input ssn $ fdate monyy5. uic $ ;
cards;
123123123 JUN01 68221
123123123 AUG01 68222
234234234 MAY01 74221
234234234 JUN01 74222
345345345 AUG01 85221
345345345 SEP01 85222
453453453 OCT01 95221
567567567 FEB01 00221
678678678 DEC01 11221
789789789 SEP01 22331
789789789 OCT01 22332
789789789 OCT01 22333
901901901 APR01 44441
;
Now here's a solution in one SQL statement;
proc sql;
select svy.*,
intck('month',datepart(srvy_dt),fdate) as apart,
(-4<calculated apart<3)*
( (mod(sign(calculated apart)+2,3)+1)*10
-abs(calculated apart)
) as howgood,
case when calculated howgood >0 then fdate end
as fdate format monyy7.,
case when calculated howgood >0 then uic end as uic
from svy left join prec
on svy.ssn=prec.ssn
group by svy.ssn
having howgood = max(howgood);
quit;
Result:
ssn srvy_dt apart howgood fdate uic
12312312 18JUN01:00:00:00 0 30 JUN2001 68221
23423423 19JUN01:00:00:00 0 30 JUN2001 74222
34534534 20SEP01:00:00:00 0 30 SEP2001 85222
45345345 20SEP01:00:00:00 1 9 OCT2001 95221
56756756 29JUN01:00:00:00 -4 0 .
67867867 15OCT01:00:00:00 2 8 DEC2001 11221
67867867 16OCT01:00:00:00 2 8 DEC2001 11221
78978978 21NOV01:00:00:00 -1 19 OCT2001 22333
78978978 21NOV01:00:00:00 -1 19 OCT2001 22332
89089089 11DEC01:00:00:00 . . .
The working variables (APART, HOWGOOD) can be shed via a DROP= dataset
option.
As Sig pointed out, if an individual has two personnel records for a given
month, additional rows will be propagated. Otherwise, there should be one
row for each observation in SVY.
On Thu, 26 Sep 2002 22:58:13 -0400, Sigurd Hermansen <hermans1@WESTAT.COM>
wrote:
>yes, database programming does seem complicated. Results depend on
>properties of data as well as the program.
>
>You may have to evaluate the properties of your data. A SQL solution to the
>problem as you stated it fails when the granularity of year/month places
>any two survey responses or personnel record updates less than a month
>apart in the same month. In that case both may match:
>
>data Surveys;
> ID=2; Dt=today(); output;
> ID=2; Dt=today()-31; output;
> ID=2; Dt=today()-61; output;
> ID=1; Dt=today()-31; output;
> ID=3; Dt=today()-31; output;
>run;
>data Persons;
> PID=1; recDt=today(); output;
> PID=1; recDt=today()-31; output;
> PID=2; recDt=today()-31; output;
> PID=3; recDt=today(); output;
> PID=3; recDt=today()-15; output;
>run;
>proc format;
> value dif
> 0='6'
> -1='5'
> -2='4'
> -3='3'
> 1='2'
> 2='1'
> otherwise '0'
> ;
>run;
>proc sql;
> select t1.ID,t1.Dt format=mmddyy10.,t2.recDt format=mmddyy10.
> from Surveys as t1 left join Persons as t2
> on t1.ID=t2.PID group by t1.ID
> having put(intck('MONTH','01jan1960'd,recDt)-intck
> ('month','01jan1960'd,Dt),dif.)=
> max(put(intck('MONTH','01jan1960'd,recDt)-intck
> ('month','01jan1960'd,Dt),dif.))
> and not put(intck('MONTH','01jan1960'd,recDt)-intck
> ('month','01jan1960'd,Dt),dif.)='0'
> ;
>quit;
>
>Sig
>
>On Thu, 26 Sep 2002 16:31:32 -0500, Alderton, David L. Ph.D.
><P13@PERSNET.NAVY.MIL> wrote:
>
>>Howard and Sig have pointed out the limitations of my description on the
>>constraints. Let me try and fill them in. Survey administration is
>>triggered by a personnel event, such as a promotion, transfer, or
>>separation. Many of the items refer to the context for making a decision.
>>Since a person is often being moved to a new command/location, it is
>>important that the survey be tied to the losing command (that providing
the
>>context) and not the gaining command. In almost all cases, the
appropriate
>>personnel record match will be the one from the same month, the next best
>>match would be from the month before, or the month before that (current, 1
>>back, 2 back, 3 back). In following that, the next best match would be
for
>>1 month ahead, followed by a less desirable 2 months ahead (1 up, 2 up).
>>That actually gives a 6 month time span with these priorities (current, 1
>>back, 2 back, 3 back, 1 up, 2 up) to find a match which is more than
>>sufficient... if no match occurs within this time frame than the SSN has
an
>>input error (intentional or accidental).
>>
>>Does this help frame the problem better. Unfortunately, I cannot come up
>>with a straightforward representation for this. Perhaps doing this twice,
>>once for (current, 1 back, 2 back, 3 back), then subsetting the records
for
>>those without a match, doing it again for (1 up, 2 up), then a final
subset
>>for those with no match (since I still want to keep the survey response
>>data).
>>
>>Gads, this is complicated which is why I'm seeking your guidance!
>>
>>Thanks, David.
>>
>>> -----Original Message-----
>>> From: Alderton, David L. Ph.D.
>>> Sent: Wednesday, September 25, 2002 4:07 PM
>>> To: 'SAS-L'
>>> Subject: Merging with most closely dated record
>>>
>>> I have a problem that I'm not sure how to approach. I have a
>continuously
>>> updated database of web survey responses. Some Sailors have taken the
>>> survey 2 or 3 times at different points in time. Each survey record is
>>> uniquely identified by SSN and the survey date (Srvy_DT is of the form
>>> '07APR2001:00:00:00.000'DT).
>>>
>>> I also have monthly snapshots from official personnel records from which
>I
>>> extract a Sailor's paygrade, current assignment location, marital
status,
>>> etc. Some of these variables change over time. The personnel records
>are
>>> "time stamped" simply as MMMYY (APR02). So what I want to do is merge a
>>> Sailor's personnel record that is closest in time, either in the same
>>> month or the month before the date the survey was administered; or if
>>> there is only one matching personnel record I want to use that.
>>>
>>> If the data are something like the following, what I want is for the 1st
>>> svy record to be married with the 1st prec record, the 2nd svy record to
>>> also be merged with the 1st prec record, the 3rd svy record to be merged
>>> with the 4th prec record, and the 4th svy record to be merged with the
>6th
>>> prec record, and 5th svy record to be merged with the only corresponding
>>> SSN in prec (7th record).
>>>
>>> Can someone help me frame an approach to this?
>>>
>>> Thanks, David.
>>> David L. Alderton, Ph.D.
>>> NPRST/PERS-1
>>>
>>> data svy;
>>> length ssn $ 9;
>>> input ssn $ srvy_dt datetime22.3 ;
>>> svy_my=datepart(srvy_dt);
>>> cards;
>>> 123123123 18JUN2001:00:00:00.000
>>> 123123123 18JUL2001:00:00:00.000
>>> 234234234 19JUN2001:00:00:00.000
>>> 345345345 20SEP2001:00:00:00.000
>>> 453453453 20SEP2001:00:00:00.000
>>> ;
>>> run;
>>> data prec;
>>> length ssn $ 9 uic $ 5 ;
>>> input ssn $ fdate monyy5. uic $ ;
>>> cards;
>>> 123123123 JUN01 68221
>>> 123123123 AUG01 68222
>>> 234234234 MAY01 74221
>>> 234234234 JUN01 74222
>>> 345345345 AUG01 85221
>>> 345345345 SEP01 85222
>>> 453453453 OCT01 95221
>>> ;
>>> run;
>>> ----------Desired outcome-------
>>> SSN srvy_dt fdate UIC
>>> 123123123 18JUN2001:00:00:00.000 JUN01 68221
>>> 123123123 18JUL2001:00:00:00.000 JUN01 68221
>>> 234234234 19JUN2001:00:00:00.000 JUN01 74222
>>> 345345345 20SEP2001:00:00:00.000 SEP01 85222
>>> 453453453 20SEP2001:00:00:00.000 OCT01 95221
|