LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (January 2001, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Wed, 10 Jan 2001 15:56:21 GMT
Reply-To:   Charles_S_Patridge@PRODIGY.NET
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Charles Patridge <Charles_S_Patridge@PRODIGY.NET>
Organization:   Deja.com
Subject:   Re: Fuzzy matching!

Calvin,

I did a little web search on "soundex" and came up with a number of hits on the keyword "soundex" which you can do with your own browser.

In any case, here is one bit of info which you requested about how soundex works:

Soundex English word-sounding Algorithm M. K. Odell and R. C. Russell patented the Soundex phonetic comparison system in 1918 and 1922. Soundex coding takes an English word and produces a four digit representation of the word designed to match the phonetic pronunciation of the word. It is normally used for ``fuzzy'' searches where a close match may be desired. For example, to come up with alternative possibilities for a misspelled word some spelling checker programs generate a Soundex code for the misspelled word and then suggest other words with the same Soundex value. Additionally Soundex codes are often used on surnames which are difficult to spell.

The creation of a Soundex code is a pretty simple operation. The first step is to remove all non-English letters or symbols. In the case of accented vowels, simply remove the accents. Any hyphens, spaces, etc... also. In addition, remove all H's and W's unless they are the initial letter in the word. Next, take the first letter in the word and make it the first letter of the Soundex code. For each remaining letter in the word, translate it to a number with the table below and concatenate the numbers, preserving order, on to the Soundex value.

------------------------------------------------------------------------ --------

A, E, I, O, U, Y = 0 B, F, P, V = 1 C, G, J, K, Q, S, X, Z = 2 D, T = 3 L = 4 M, N = 5 R = 6

------------------------------------------------------------------------ --------

Now, combine any double numbers into a single instance of that number. Further, if the first number in the Soundex value is the same as the code number for the initial letter, delete the first number. Now, remove all zeros from the Soundex string. Finally, return the first four characters of the end product as the Soundex encoding. If there are less than four characters to be returned, concatenate enough zeros to make the length four.

------------------------------------------------------------------------ --------

Here are some of my thoughts about your request:

As I stated in my reply to you last night, I have found the sound like operator in SQL to be OK at times, and at other times not to be what I needed. Much depends on the data you are working with and how well it is organized (separate fields are best). However, when multiple fields are contained in 1 string such as "Charles Patridge 172 Monce Rd Burlington CT 06013" and you are not able to separate the contents into separate fields such as "Charles" "Patridge" "172 Monce Rd" "Burlington" "CT" "06013"

then the sound like operator does not perform very well, understandably so. When dealing with many different companies/data files, I typically revert to using some form or variation of my "fuzzy" match routines so that I can control what, where and how to fuzzy match fields as each field (First Name, Last Name, Address, State, Zip, Birthdate, etc) has its own properties, rules and/or attributes which should not be intermixed with each other for matching purposes.

Finally, I have started to compile a list of references on record linkages (TIP00219) which can often involve the "soundex" function. This tip can be found on my web site http://www.sconsig.com under Tips and Techniques. I will add more references as I obtain them.

Good Luck and I hope my input as helped you in some small way.

Regards, Charles Patridge Email: Charles_S_Patridge@prodigy.net

You also might try another web page of mine for more related references, links, materials and software, including "dataflux"

http://www.sconsig.com/e-scrub.htm

when using such software as "dataflux" be prepared to do ALOT of testing for desired results, tuning the parameters, learning the rules, building your own rules for increased desired hits, etc.

And MAKE SURE YOU TEST THOROUGHLY to see the quirks.

In article <B34F74865B47D2119E9D0000F8B844710BDE1204@DFSPO02.dofasco.ca>, bill_droogendyk@DOFASCO.CA wrote: > Calvin: > > from the demo that I saw, I think that DataFlux from www.sas.com would do > the trick - take a look > > W.(Bill) Droogendyk > Quality Systems > Dofasco Inc. Hamilton ON Canada > Telephone: 905 548 7200 x3359 > Fax: 905 548 4007 > > Still using MS toys? XLR82SAS! > > > -----Original Message----- > > From: Calvin Huang-CEO of UITRA [SMTP:wong_ahpo@HKICABLE.COM] > > Sent: Tuesday, 09 January, 2001 14:23 > > To: SAS-L@LISTSERV.UGA.EDU > > Subject: Fuzzy matching! > > > > Hi all, > > I am conducting a datamining project in a bank now, and sometime I > > have > > to do some matching to update the database. > > A problem comes up when I do matching ........... > > One of the most difficult task is : > > sometime the address/name declared by the customers is mistyped/mispelled, > > and this leads to a great problem when we do some matching... > > e.g. in my database mary's address is : > > > > Flat A 35/F Block C Tei Ko Shing HongKong > > > > which is contained in one field called CUSADD. > > > > Now I have a list of estate in Hongkong and was told to identify the which > > estate particular customer is living in. > > But 'Tei Ko Shing' is mispelled in mary's address where 'TaiKoo Shing' is > > the right one which is contained in the list of estate. > > > > If the mary's estate was not mispelled then, I can use the index to > > complete > > the task easily... > > Or if the address is well formatted with the estate as a separate field. > > that I can use the sound like operator to do the joining in PROC SQL. > > > > But now...., it is very difficult to do so..... > > Can you share your experience with me if you have ever faced/tackled > > similar > > problem? > > > > Actually I just know the usage of the sound like operator in SQL, but dont > > know much about the algorithm behind ( Soundex Algorithm ), can you share > > you knowledge on this algorithm with me or refer me some reference. > > > > Thanks a lot for reading my problem patiently. > > > > Calvin. >

-- Charles Patridge - PDPC, Ltd. 172 Monce Road - Burlington, CT 06013 USA Phone: 860-673-9278 or 860-675-9026 Email: Charles_S_Patridge@prodigy.net - Web: www.sconsig.com

Sent via Deja.com http://www.deja.com/


Back to: Top of message | Previous page | Main SAS-L page