LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (July 2002, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 17 Jul 2002 18:57:26 -0400
Reply-To:     B H R <rubarb@DESPAMMED.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         B H R <rubarb@DESPAMMED.COM>
Subject:      Re: merging individuals by family

Thanks, Jim and Ian, for your attempts to help me with this problem. I'm sorry I wasn't completely clear enough; if I may follow up:

If I take the first part of Ian's method for creating LOOK from XFAM: proc sql ; create table look as select a.* , b.* from a , b , xfam where a.household = xfam.household and b.family = xfam.family ; I can limit my search to the possible good connections -- but the rest of the code: and (( spedis (a.firstname , b.first) < 25 and spedis (a.lastname , b.last ) < 25 ) or ( spedis (a.firstname , b.first) < 25 and abs( a.dob - b.dob ) < 10 )) doesn't actually select the mothers I was looking for --it's their relations who have matches ...and that's what I want to base my mother match on : HHs where more than one member points to the same family id.

If I save the good match measure to a variable (testfit) like: data look; set look; testfit=sum (soundex(afirst)=soundex(bfirst), soundex(alast)=soundex(blast) , year(adob)=year(bdob), month(adob)=month(bdob), day(adob)=day(bdob), 0<abs( adob - bdob ) < 10 ) );

followed by: proc means noprint data=look;var testsum; output out=out1 n=testfit; by household; where testsum ge 4;

for HHs where 2 or more individuals have totsum ge 4 I can accept a lower value of totsum, say 2, for concluding that the mother matches.

HOWEVER, this method depends on pre-finding (creating xfam) matches for everyone in one of the datasets. I was hoping to develop SAS code that lets me do the whole thing on the fly. For each mother, I would collect the first names, last names and birthdates of the HH members; then as I read the other file, whenever I came across an individual with the mother's first or last name I would similarly collect all of their family members' attributes and then compare the two groups.

I think this would look somewhat like what Jim proposed in his "intelligent method" except that 1)the Cartesian join would be conditional and 2) I think I would need to create arrays of individuals as the data are read in for each dataset. For each of these issues I would appreciate help in coding (if it's possible AND if it's a reasonable approach when working with these fairly large datasets).

(Also, alternatively, could I use all of the firstnames and lastnames in the household as blocking factors in comparing to the other dataset rather than just the mother? )

Thanks, again, for any help anyone can provide. -Barbara

On Sun, 14 Jul 2002 14:54:47 -0400, Ian Whitlock <WHITLOI1@WESTAT.COM> wrote:

>Barbara, > >I would begin by trying to get a link between household and family. In >general SQL will prove the best environment for a solution to your problem, >since one can define matching without having equality. > >For the first pass you might assume somebody will be completely consistent >in both groups. In that case consider > >DATA A ; input >HOUSEHOLD PERSON LASTNAME $ @29 FIRSTNAME $char12. DOB mmddyy8. SEX >:$1. ; >cards ; >1 123 SMITH ANNA MISSING F >1 125 SMITH MICHAEL 12/12/00 M >1 145 GREEN HOMAS 7/17/96 M >15 523 HARRIS HELEN 12/9/72 F >15 212 SMITHERS MICHAEL 12/12/00 M >15 302 HARRIS SUSAN 4/29/00 F >15 307 HARRIS TIM 1/2/01 M >; > >DATA B ; input >FAMILY PERSON $ LAST $ @31 FIRST $char12. DOB mmddyy8. SEX :$1. ; >cards ; >A44 AB2 SMITH ANNA MAE 12/3/85 F >A44 EF6 SMITH STEPHANIE 3/15/95 F >A44 CD12 SMITH MICHAEL 12/12/00 M >A44 GH6 GREEN THOMAS 7/17/96 M >B65 JK1 SMITHERS HELENE 1/1/72 F >B65 LM5 SMITHERS MICHAEL 12/12/00 M >B65 NO7 SMITHERS SUSAN 4/28/00 F >B65 PQZ SMITHERS TIM 1/2/01 M >B65 RSS SMITHERS TINA 8/11/01 M >; > >proc sql ; > create table xfam as > select distinct household , family > from a , b > where a.last=b.last > and a.first=b.first > and a.dob = b.dob > ; >quit ; > >Now you might proceed with: > >proc sql ; > create table look as > select a.* , b.* > from a , b , xfam > where a.household = xfam.household > and b.family = xfam.family > and (( spedis (a.firstname , b.first) < 25 > and spedis (a.lastname , b.last ) < 25 ) > or ( spedis (a.firstname , b.first) < 25 > and abs( a.dob - b.dob ) < 10 )) > ; >quit ; > >Some mother matching should be caught by this code or some variation of it. >Hopefully the number of hard core holdouts will be small enough for >eyeballing. Add HOUSEHOLD to B and print information for the problem >groups. > >It will take a lot more work, but at least you have seen some of the tools >commonly used in attacking this sort of problem. > >You might wonder why produce XFAM first. XFAM is used as a blocking factor >to reduce the number of records where the more complex matching clauses have >to be employed. In other words, it helps to weed out the trash. > >IanWhitlock@westat.com > >-----Original Message----- >From: B H R [mailto:rubarb@DESPAMMED.COM] >Sent: Wednesday, July 10, 2002 12:34 PM >To: SAS-L@LISTSERV.UGA.EDU >Subject: merging individuals by family > > >I am working with two somewhat dirty datasets that are similarly structured: > >DATA Set A example: >HOUSEHOLD PERSON# LASTNAME FIRSTNAME DOB SEX >1 123 SMITH ANNA MISSING F >1 125 SMITH MICHAEL 12/12/00 M >1 145 GREEN HOMAS 7/17/96 M > >15 523 HARRIS HELEN 12/9/72 F >15 212 SMITHERS MICHAEL 12/12/00 M >15 302 HARRIS SUSAN 4/29/00 F >15 307 HARRIS TIM 1/2/01 M > >DATA SET B example: >FAMILY PERSON LAST FIRST DOB SEX >A44 AB2 SMITH ANNA MAE 12/3/85 F >A44 EF6 SMITH STEPHANIE 3/15/95 F >A44 CD12 SMITH MICHAEL 12/12/00 M >A44 GH6 GREEN THOMAS 7/17/96 M > >B65 JK1 SMITHERS HELENE 1/1/72 F >B65 LM5 SMITHERS MICHAEL 12/12/00 M >B65 NO7 SMITHERS SUSAN 4/28/00 F >B65 PQZ SMITHERS TIM 1/2/01 M >B65 RSS SMITHERS TINA 8/11/01 M > >I am trying to find a group of mothers from the larger dataset A >(n=388,000) who I believe are in the smaller dataset B (n=160,000). >There is only one mother per household. For those I have been unable >to identify by using only their own names and birth dates I would like >to use information about the other family/household members to find >them. So, in my example data above, I would like to link Anna Smith >in Dataset A, (person # 123) and Anna Mae Smith (personid AB2) in the >Dataset B, because not only is her own data fairly consistent, but >there are also two other people, Michael and Thomas, in both datasets >who have good matches; similarly for Helen Harris and Helene Smithers. > >Can anyone suggest a good method using SAS to compare persons within >households? > >When I do this manually, I pick a subset from the other dataset that >seems to contain likely suspects, e.g. all persons in cases where >someone meets the criterion of LAST IN ('SMITH', 'GREEN'); or >SOUNDEX(FIRST)=SOUNDEX('HELEN'); my main problem in thinking about >writing SAS code is how to cycle through all the members within a >family/household group in each dataset (unless someone has another >thought on how to approach the problem). > >My more general question is how to compare everyone in the >two datasets by family. Is there anything different about the larger >merge (other than that the hh/family ids are available when one >individual in a family was matchable by individual person info alone)? > >Thanks for any help anyone can provide! > >- Barbara


Back to: Top of message | Previous page | Main SAS-L page