LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (April 2002, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Mon, 1 Apr 2002 14:14:35 -0500
Reply-To:   "Fehd, Ronald J." <rjf2@CDC.GOV>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   "Fehd, Ronald J." <rjf2@CDC.GOV>
Subject:   Re: Soft Matches (SSN typos)
Comments:   cc: Charles <cwentzel@PHC4.ORG>
Content-Type:   text/plain

> From: Charles [mailto:cwentzel@PHC4.ORG] > So, I'm performing matches within a dataset. Let's say I'm > doing a proc sql with two tables a and b.

I think you might consider output the non-matches to separate tables --consider the exclusive or operator, in SAS that's xor(A,B) DATA BOTH XOR; merge A(in=haveA) B(in=haveB); by ID; if and(HaveA,HaveB) then output BOTH; %* ;-) ; else if xor(HaveA,HaveB) then output XOR; who knows what it is in SQL? --

and examine them, to see if you can't get a grip on what the most common transpositions are. XXX-XX-XXXX remember SS-3, and SS-2, the first five digits of SS, are regional codes.

Given a group of SSn from a state you'll find most of your variation in SS-4 which is why it is sometimes used as a password/authentication string.

Given those factoids I'd look for transpositions in columns 1&2, 2&3, not 3&4, 4&5, not 5&6 6&7, 7&8, 8&9,

Ron Fehd the macro maven CDC Atlanta GA USA RJF2@cdc.gov OpSys: Win_Pro Ver: 8.2

Repetition obfuscates! Repetition reduction enhances elegance! Repetition reduction furthers finesse!

points for poofreading the and(HaveA,HaveB) phrase

> If the ssn's are equal, I can just say (in the where or join > statement): > > a.ssn=b.ssn > > If I want to allow the ssns to differ by one digit, I can say: > > ( > (substr(a.ssn,1,1)=substr(b.ssn,1,1))+ > (substr(a.ssn,2,1)=substr(b.ssn,2,1))+ > (substr(a.ssn,3,1)=substr(b.ssn,3,1))+ > (substr(a.ssn,4,1)=substr(b.ssn,4,1))+ > (substr(a.ssn,5,1)=substr(b.ssn,5,1))+ > (substr(a.ssn,6,1)=substr(b.ssn,6,1))+ > (substr(a.ssn,7,1)=substr(b.ssn,7,1))+ > (substr(a.ssn,8,1)=substr(b.ssn,8,1))+ > (substr(a.ssn,9,1)=substr(b.ssn,9,1)) > )=8 > > But what if I want to look for ssns that have two characters > transposed? I > started with: > > ( > (a.ssn = > substr(b.ssn,2,1)||substr(b.ssn,1,1)||substr > (b.ssn,3)) or > (a.ssn > =substr(b.ssn,1,1)||substr(b.ssn,3,1)||substr(b.ssn,2,1)||substr > (b.ssn,4)) or > (a.ssn > =substr(b.ssn,1,2)||substr(b.ssn,4,1)||substr(b.ssn,3,1)||substr > (b.ssn,5)) or > (a.ssn > =substr(b.ssn,1,3)||substr(b.ssn,5,1)||substr(b.ssn,4,1)||substr > (b.ssn,6)) or > (a.ssn > =substr(b.ssn,1,4)||substr(b.ssn,6,1)||substr(b.ssn,5,1)||substr > (b.ssn,7)) or > (a.ssn > =substr(b.ssn,1,5)||substr(b.ssn,7,1)||substr(b.ssn,6,1)||substr > (b.ssn,8)) or > (a.ssn > =substr(b.ssn,1,6)||substr(b.ssn,8,1)||substr(b.ssn,7,1)||substr > (b.ssn,9)) or > (a.ssn =substr(b.ssn,1,7)||substr(b.ssn,9,1)||substr(b.ssn,8,1) > ) > > but this obviously only addresses transposed neighboring > digits. I could > continue with all (9c2)=36 possible combinations... but I > feel like I'm > missing something. Is there a simpler way? > > Plus, I'm quite curious to see how y'all respond to the idea of soft > matches in general... Any good references out there? >


Back to: Top of message | Previous page | Main SAS-L page