|
> From: Charles [mailto:cwentzel@PHC4.ORG]
> So, I'm performing matches within a dataset. Let's say I'm
> doing a proc sql with two tables a and b.
I think you might consider output the non-matches to separate tables
--consider the exclusive or operator,
in SAS that's xor(A,B)
DATA BOTH XOR;
merge A(in=haveA)
B(in=haveB);
by ID;
if and(HaveA,HaveB) then output BOTH; %* ;-) ;
else if xor(HaveA,HaveB) then output XOR;
who knows what it is in SQL? --
and examine them, to see if you can't get a grip on what the most common
transpositions are.
XXX-XX-XXXX
remember SS-3, and SS-2, the first five digits of SS,
are regional codes.
Given a group of SSn from a state
you'll find most of your variation in SS-4
which is why it is sometimes used as a password/authentication string.
Given those factoids
I'd look for transpositions in columns 1&2, 2&3,
not 3&4,
4&5,
not 5&6
6&7, 7&8, 8&9,
Ron Fehd the macro maven CDC Atlanta GA USA RJF2@cdc.gov
OpSys: Win_Pro Ver: 8.2
Repetition obfuscates!
Repetition reduction enhances elegance!
Repetition reduction furthers finesse!
points for poofreading the and(HaveA,HaveB) phrase
> If the ssn's are equal, I can just say (in the where or join
> statement):
>
> a.ssn=b.ssn
>
> If I want to allow the ssns to differ by one digit, I can say:
>
> (
> (substr(a.ssn,1,1)=substr(b.ssn,1,1))+
> (substr(a.ssn,2,1)=substr(b.ssn,2,1))+
> (substr(a.ssn,3,1)=substr(b.ssn,3,1))+
> (substr(a.ssn,4,1)=substr(b.ssn,4,1))+
> (substr(a.ssn,5,1)=substr(b.ssn,5,1))+
> (substr(a.ssn,6,1)=substr(b.ssn,6,1))+
> (substr(a.ssn,7,1)=substr(b.ssn,7,1))+
> (substr(a.ssn,8,1)=substr(b.ssn,8,1))+
> (substr(a.ssn,9,1)=substr(b.ssn,9,1))
> )=8
>
> But what if I want to look for ssns that have two characters
> transposed? I
> started with:
>
> (
> (a.ssn =
> substr(b.ssn,2,1)||substr(b.ssn,1,1)||substr
> (b.ssn,3)) or
> (a.ssn
> =substr(b.ssn,1,1)||substr(b.ssn,3,1)||substr(b.ssn,2,1)||substr
> (b.ssn,4)) or
> (a.ssn
> =substr(b.ssn,1,2)||substr(b.ssn,4,1)||substr(b.ssn,3,1)||substr
> (b.ssn,5)) or
> (a.ssn
> =substr(b.ssn,1,3)||substr(b.ssn,5,1)||substr(b.ssn,4,1)||substr
> (b.ssn,6)) or
> (a.ssn
> =substr(b.ssn,1,4)||substr(b.ssn,6,1)||substr(b.ssn,5,1)||substr
> (b.ssn,7)) or
> (a.ssn
> =substr(b.ssn,1,5)||substr(b.ssn,7,1)||substr(b.ssn,6,1)||substr
> (b.ssn,8)) or
> (a.ssn
> =substr(b.ssn,1,6)||substr(b.ssn,8,1)||substr(b.ssn,7,1)||substr
> (b.ssn,9)) or
> (a.ssn =substr(b.ssn,1,7)||substr(b.ssn,9,1)||substr(b.ssn,8,1)
> )
>
> but this obviously only addresses transposed neighboring
> digits. I could
> continue with all (9c2)=36 possible combinations... but I
> feel like I'm
> missing something. Is there a simpler way?
>
> Plus, I'm quite curious to see how y'all respond to the idea of soft
> matches in general... Any good references out there?
>
|