Date: Mon, 1 Apr 2002 14:14:35 -0500 "Fehd, Ronald J." "SAS(r) Discussion" "Fehd, Ronald J." Re: Soft Matches (SSN typos) cc: Charles text/plain

> From: Charles [mailto:cwentzel@PHC4.ORG] > So, I'm performing matches within a dataset. Let's say I'm > doing a proc sql with two tables a and b.

I think you might consider output the non-matches to separate tables --consider the exclusive or operator, in SAS that's xor(A,B) DATA BOTH XOR; merge A(in=haveA) B(in=haveB); by ID; if and(HaveA,HaveB) then output BOTH; %* ;-) ; else if xor(HaveA,HaveB) then output XOR; who knows what it is in SQL? --

and examine them, to see if you can't get a grip on what the most common transpositions are. XXX-XX-XXXX remember SS-3, and SS-2, the first five digits of SS, are regional codes.

Given a group of SSn from a state you'll find most of your variation in SS-4 which is why it is sometimes used as a password/authentication string.

Given those factoids I'd look for transpositions in columns 1&2, 2&3, not 3&4, 4&5, not 5&6 6&7, 7&8, 8&9,

Ron Fehd the macro maven CDC Atlanta GA USA RJF2@cdc.gov OpSys: Win_Pro Ver: 8.2

Repetition obfuscates! Repetition reduction enhances elegance! Repetition reduction furthers finesse!

points for poofreading the and(HaveA,HaveB) phrase

> If the ssn's are equal, I can just say (in the where or join > statement): > > a.ssn=b.ssn > > If I want to allow the ssns to differ by one digit, I can say: > > ( > (substr(a.ssn,1,1)=substr(b.ssn,1,1))+ > (substr(a.ssn,2,1)=substr(b.ssn,2,1))+ > (substr(a.ssn,3,1)=substr(b.ssn,3,1))+ > (substr(a.ssn,4,1)=substr(b.ssn,4,1))+ > (substr(a.ssn,5,1)=substr(b.ssn,5,1))+ > (substr(a.ssn,6,1)=substr(b.ssn,6,1))+ > (substr(a.ssn,7,1)=substr(b.ssn,7,1))+ > (substr(a.ssn,8,1)=substr(b.ssn,8,1))+ > (substr(a.ssn,9,1)=substr(b.ssn,9,1)) > )=8 > > But what if I want to look for ssns that have two characters > transposed? I > started with: > > ( > (a.ssn = > substr(b.ssn,2,1)||substr(b.ssn,1,1)||substr > (b.ssn,3)) or > (a.ssn > =substr(b.ssn,1,1)||substr(b.ssn,3,1)||substr(b.ssn,2,1)||substr > (b.ssn,4)) or > (a.ssn > =substr(b.ssn,1,2)||substr(b.ssn,4,1)||substr(b.ssn,3,1)||substr > (b.ssn,5)) or > (a.ssn > =substr(b.ssn,1,3)||substr(b.ssn,5,1)||substr(b.ssn,4,1)||substr > (b.ssn,6)) or > (a.ssn > =substr(b.ssn,1,4)||substr(b.ssn,6,1)||substr(b.ssn,5,1)||substr > (b.ssn,7)) or > (a.ssn > =substr(b.ssn,1,5)||substr(b.ssn,7,1)||substr(b.ssn,6,1)||substr > (b.ssn,8)) or > (a.ssn > =substr(b.ssn,1,6)||substr(b.ssn,8,1)||substr(b.ssn,7,1)||substr > (b.ssn,9)) or > (a.ssn =substr(b.ssn,1,7)||substr(b.ssn,9,1)||substr(b.ssn,8,1) > ) > > but this obviously only addresses transposed neighboring > digits. I could > continue with all (9c2)=36 possible combinations... but I > feel like I'm > missing something. Is there a simpler way? > > Plus, I'm quite curious to see how y'all respond to the idea of soft > matches in general... Any good references out there? >

Back to: Top of message | Previous page | Main SAS-L page