Date: Wed, 17 Jul 2002 18:57:26 -0400
Reply-To: B H R <rubarb@DESPAMMED.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: B H R <rubarb@DESPAMMED.COM>
Subject: Re: merging individuals by family
Thanks, Jim and Ian, for your attempts to help me with this problem.
I'm sorry I wasn't completely clear enough; if I may follow up:
If I take the first part of Ian's method for creating LOOK from XFAM:
proc sql ;
create table look as select a.* , b.*
from a , b , xfam where a.household = xfam.household
and b.family = xfam.family ;
I can limit my search to the possible good connections -- but the rest of
the code:
and (( spedis (a.firstname , b.first) < 25 and spedis (a.lastname ,
b.last ) < 25 ) or ( spedis (a.firstname , b.first) < 25
and abs( a.dob - b.dob ) < 10 ))
doesn't actually select the mothers I was looking for --it's their
relations who have matches ...and that's what I want to base my mother
match on : HHs where more than one member points to the same family id.
If I save the good match measure to a variable (testfit) like:
data look; set look;
testfit=sum (soundex(afirst)=soundex(bfirst),
soundex(alast)=soundex(blast) ,
year(adob)=year(bdob),
month(adob)=month(bdob),
day(adob)=day(bdob),
0<abs( adob - bdob ) < 10 )
);
followed by:
proc means noprint data=look;var testsum;
output out=out1 n=testfit;
by household;
where testsum ge 4;
for HHs where 2 or more individuals have totsum ge 4 I can accept a lower
value of totsum, say 2, for concluding that the mother matches.
HOWEVER, this method depends on pre-finding (creating xfam) matches for
everyone in one of the datasets.
I was hoping to develop SAS code that lets me do the whole thing on the
fly. For each mother, I would collect the first names, last names and
birthdates of the HH members; then as I read the other file, whenever I
came across an individual with the mother's first or last name I would
similarly collect all of their family members' attributes and then
compare the two groups.
I think this would look somewhat like what Jim proposed in his
"intelligent method" except that 1)the Cartesian join would be conditional
and 2) I think I would need to create arrays of individuals as the data are read in for each dataset. For each of these issues I would appreciate help in coding (if it's possible AND if it's a reasonable approach when working
with these fairly large datasets).
(Also, alternatively, could I use all of the firstnames and lastnames in the household as blocking factors in comparing to the other dataset rather than just the mother? )
Thanks, again, for any help anyone can provide.
-Barbara
On Sun, 14 Jul 2002 14:54:47 -0400, Ian Whitlock <WHITLOI1@WESTAT.COM> wrote:
>Barbara,
>
>I would begin by trying to get a link between household and family. In
>general SQL will prove the best environment for a solution to your problem,
>since one can define matching without having equality.
>
>For the first pass you might assume somebody will be completely consistent
>in both groups. In that case consider
>
>DATA A ; input
>HOUSEHOLD PERSON LASTNAME $ @29 FIRSTNAME $char12. DOB mmddyy8. SEX
>:$1. ;
>cards ;
>1 123 SMITH ANNA MISSING F
>1 125 SMITH MICHAEL 12/12/00 M
>1 145 GREEN HOMAS 7/17/96 M
>15 523 HARRIS HELEN 12/9/72 F
>15 212 SMITHERS MICHAEL 12/12/00 M
>15 302 HARRIS SUSAN 4/29/00 F
>15 307 HARRIS TIM 1/2/01 M
>;
>
>DATA B ; input
>FAMILY PERSON $ LAST $ @31 FIRST $char12. DOB mmddyy8. SEX :$1. ;
>cards ;
>A44 AB2 SMITH ANNA MAE 12/3/85 F
>A44 EF6 SMITH STEPHANIE 3/15/95 F
>A44 CD12 SMITH MICHAEL 12/12/00 M
>A44 GH6 GREEN THOMAS 7/17/96 M
>B65 JK1 SMITHERS HELENE 1/1/72 F
>B65 LM5 SMITHERS MICHAEL 12/12/00 M
>B65 NO7 SMITHERS SUSAN 4/28/00 F
>B65 PQZ SMITHERS TIM 1/2/01 M
>B65 RSS SMITHERS TINA 8/11/01 M
>;
>
>proc sql ;
> create table xfam as
> select distinct household , family
> from a , b
> where a.last=b.last
> and a.first=b.first
> and a.dob = b.dob
> ;
>quit ;
>
>Now you might proceed with:
>
>proc sql ;
> create table look as
> select a.* , b.*
> from a , b , xfam
> where a.household = xfam.household
> and b.family = xfam.family
> and (( spedis (a.firstname , b.first) < 25
> and spedis (a.lastname , b.last ) < 25 )
> or ( spedis (a.firstname , b.first) < 25
> and abs( a.dob - b.dob ) < 10 ))
> ;
>quit ;
>
>Some mother matching should be caught by this code or some variation of it.
>Hopefully the number of hard core holdouts will be small enough for
>eyeballing. Add HOUSEHOLD to B and print information for the problem
>groups.
>
>It will take a lot more work, but at least you have seen some of the tools
>commonly used in attacking this sort of problem.
>
>You might wonder why produce XFAM first. XFAM is used as a blocking factor
>to reduce the number of records where the more complex matching clauses have
>to be employed. In other words, it helps to weed out the trash.
>
>IanWhitlock@westat.com
>
>-----Original Message-----
>From: B H R [mailto:rubarb@DESPAMMED.COM]
>Sent: Wednesday, July 10, 2002 12:34 PM
>To: SAS-L@LISTSERV.UGA.EDU
>Subject: merging individuals by family
>
>
>I am working with two somewhat dirty datasets that are similarly structured:
>
>DATA Set A example:
>HOUSEHOLD PERSON# LASTNAME FIRSTNAME DOB SEX
>1 123 SMITH ANNA MISSING F
>1 125 SMITH MICHAEL 12/12/00 M
>1 145 GREEN HOMAS 7/17/96 M
>
>15 523 HARRIS HELEN 12/9/72 F
>15 212 SMITHERS MICHAEL 12/12/00 M
>15 302 HARRIS SUSAN 4/29/00 F
>15 307 HARRIS TIM 1/2/01 M
>
>DATA SET B example:
>FAMILY PERSON LAST FIRST DOB SEX
>A44 AB2 SMITH ANNA MAE 12/3/85 F
>A44 EF6 SMITH STEPHANIE 3/15/95 F
>A44 CD12 SMITH MICHAEL 12/12/00 M
>A44 GH6 GREEN THOMAS 7/17/96 M
>
>B65 JK1 SMITHERS HELENE 1/1/72 F
>B65 LM5 SMITHERS MICHAEL 12/12/00 M
>B65 NO7 SMITHERS SUSAN 4/28/00 F
>B65 PQZ SMITHERS TIM 1/2/01 M
>B65 RSS SMITHERS TINA 8/11/01 M
>
>I am trying to find a group of mothers from the larger dataset A
>(n=388,000) who I believe are in the smaller dataset B (n=160,000).
>There is only one mother per household. For those I have been unable
>to identify by using only their own names and birth dates I would like
>to use information about the other family/household members to find
>them. So, in my example data above, I would like to link Anna Smith
>in Dataset A, (person # 123) and Anna Mae Smith (personid AB2) in the
>Dataset B, because not only is her own data fairly consistent, but
>there are also two other people, Michael and Thomas, in both datasets
>who have good matches; similarly for Helen Harris and Helene Smithers.
>
>Can anyone suggest a good method using SAS to compare persons within
>households?
>
>When I do this manually, I pick a subset from the other dataset that
>seems to contain likely suspects, e.g. all persons in cases where
>someone meets the criterion of LAST IN ('SMITH', 'GREEN'); or
>SOUNDEX(FIRST)=SOUNDEX('HELEN'); my main problem in thinking about
>writing SAS code is how to cycle through all the members within a
>family/household group in each dataset (unless someone has another
>thought on how to approach the problem).
>
>My more general question is how to compare everyone in the
>two datasets by family. Is there anything different about the larger
>merge (other than that the hh/family ids are available when one
>individual in a family was matchable by individual person info alone)?
>
>Thanks for any help anyone can provide!
>
>- Barbara
|