|
Calvin,
I did a little web search on "soundex" and came up with a number of hits
on the keyword "soundex" which you can do with your own browser.
In any case, here is one bit of info which you requested about how
soundex works:
Soundex English word-sounding Algorithm
M. K. Odell and R. C. Russell patented the Soundex phonetic comparison
system in 1918 and 1922. Soundex coding takes an English word and
produces a four digit representation of the word designed to match the
phonetic pronunciation of the word. It is normally used for ``fuzzy''
searches where a close match may be desired. For example, to come up
with alternative possibilities for a misspelled word some spelling
checker programs generate a Soundex code for the misspelled word and
then suggest other words with the same Soundex value. Additionally
Soundex codes are often used on surnames which are difficult to spell.
The creation of a Soundex code is a pretty simple operation. The first
step is to remove all non-English letters or symbols. In the case of
accented vowels, simply remove the accents. Any hyphens, spaces, etc...
also. In addition, remove all H's and W's unless they are the initial
letter in the word. Next, take the first letter in the word and make it
the first letter of the Soundex code. For each remaining letter in the
word, translate it to a number with the table below and concatenate the
numbers, preserving order, on to the Soundex value.
------------------------------------------------------------------------
--------
A, E, I, O, U, Y = 0
B, F, P, V = 1
C, G, J, K, Q, S, X, Z = 2
D, T = 3
L = 4
M, N = 5
R = 6
------------------------------------------------------------------------
--------
Now, combine any double numbers into a single instance of that number.
Further, if the first number in the Soundex value is the same as the
code number for the initial letter, delete the first number. Now, remove
all zeros from the Soundex string. Finally, return the first four
characters of the end product as the Soundex encoding. If there are less
than four characters to be returned, concatenate enough zeros to make
the length four.
------------------------------------------------------------------------
--------
Here are some of my thoughts about your request:
As I stated in my reply to you last night, I have found the sound like
operator in SQL to be OK at times, and at other times not to be what I
needed. Much depends on the data you are working with and how well it
is organized (separate fields are best). However, when multiple fields
are contained in 1 string such as
"Charles Patridge 172 Monce Rd Burlington CT 06013" and you are not able
to separate the contents into separate fields such as
"Charles" "Patridge" "172 Monce Rd" "Burlington" "CT" "06013"
then the sound like operator does not perform very well, understandably
so. When dealing with many different companies/data files, I typically
revert to using some form or variation of my "fuzzy" match routines so
that I can control what, where and how to fuzzy match fields as each
field (First Name, Last Name, Address, State, Zip, Birthdate, etc) has
its own properties, rules and/or attributes which should not be
intermixed with each other for matching purposes.
Finally, I have started to compile a list of references on record
linkages (TIP00219) which can often involve the "soundex" function.
This tip can be found on my web site http://www.sconsig.com under Tips
and Techniques. I will add more references as I obtain them.
Good Luck and I hope my input as helped you in some small way.
Regards,
Charles Patridge
Email: Charles_S_Patridge@prodigy.net
You also might try another web page of mine for more related references,
links, materials and software, including "dataflux"
http://www.sconsig.com/e-scrub.htm
when using such software as "dataflux" be prepared to do ALOT of testing
for desired results, tuning the parameters, learning the rules, building
your own rules for increased desired hits, etc.
And MAKE SURE YOU TEST THOROUGHLY to see the quirks.
In article
<B34F74865B47D2119E9D0000F8B844710BDE1204@DFSPO02.dofasco.ca>,
bill_droogendyk@DOFASCO.CA wrote:
> Calvin:
>
> from the demo that I saw, I think that DataFlux from www.sas.com would
do
> the trick - take a look
>
> W.(Bill) Droogendyk
> Quality Systems
> Dofasco Inc. Hamilton ON Canada
> Telephone: 905 548 7200 x3359
> Fax: 905 548 4007
>
> Still using MS toys? XLR82SAS!
>
> > -----Original Message-----
> > From: Calvin Huang-CEO of UITRA [SMTP:wong_ahpo@HKICABLE.COM]
> > Sent: Tuesday, 09 January, 2001 14:23
> > To: SAS-L@LISTSERV.UGA.EDU
> > Subject: Fuzzy matching!
> >
> > Hi all,
> > I am conducting a datamining project in a bank now, and sometime
I
> > have
> > to do some matching to update the database.
> > A problem comes up when I do matching ...........
> > One of the most difficult task is :
> > sometime the address/name declared by the customers is
mistyped/mispelled,
> > and this leads to a great problem when we do some matching...
> > e.g. in my database mary's address is :
> >
> > Flat A 35/F Block C Tei Ko Shing HongKong
> >
> > which is contained in one field called CUSADD.
> >
> > Now I have a list of estate in Hongkong and was told to identify the
which
> > estate particular customer is living in.
> > But 'Tei Ko Shing' is mispelled in mary's address where 'TaiKoo
Shing' is
> > the right one which is contained in the list of estate.
> >
> > If the mary's estate was not mispelled then, I can use the index to
> > complete
> > the task easily...
> > Or if the address is well formatted with the estate as a separate
field.
> > that I can use the sound like operator to do the joining in PROC
SQL.
> >
> > But now...., it is very difficult to do so.....
> > Can you share your experience with me if you have ever faced/tackled
> > similar
> > problem?
> >
> > Actually I just know the usage of the sound like operator in SQL,
but dont
> > know much about the algorithm behind ( Soundex Algorithm ), can you
share
> > you knowledge on this algorithm with me or refer me some reference.
> >
> > Thanks a lot for reading my problem patiently.
> >
> > Calvin.
>
--
Charles Patridge - PDPC, Ltd.
172 Monce Road - Burlington, CT 06013 USA
Phone: 860-673-9278 or 860-675-9026
Email: Charles_S_Patridge@prodigy.net - Web: www.sconsig.com
Sent via Deja.com
http://www.deja.com/
|