Date: Wed, 10 Jan 2001 15:56:21 GMT Charles_S_Patridge@PRODIGY.NET "SAS(r) Discussion" Charles Patridge Deja.com Re: Fuzzy matching!

Calvin,

I did a little web search on "soundex" and came up with a number of hits on the keyword "soundex" which you can do with your own browser.

In any case, here is one bit of info which you requested about how soundex works:

Soundex English word-sounding Algorithm M. K. Odell and R. C. Russell patented the Soundex phonetic comparison system in 1918 and 1922. Soundex coding takes an English word and produces a four digit representation of the word designed to match the phonetic pronunciation of the word. It is normally used for ``fuzzy'' searches where a close match may be desired. For example, to come up with alternative possibilities for a misspelled word some spelling checker programs generate a Soundex code for the misspelled word and then suggest other words with the same Soundex value. Additionally Soundex codes are often used on surnames which are difficult to spell.

The creation of a Soundex code is a pretty simple operation. The first step is to remove all non-English letters or symbols. In the case of accented vowels, simply remove the accents. Any hyphens, spaces, etc... also. In addition, remove all H's and W's unless they are the initial letter in the word. Next, take the first letter in the word and make it the first letter of the Soundex code. For each remaining letter in the word, translate it to a number with the table below and concatenate the numbers, preserving order, on to the Soundex value.

------------------------------------------------------------------------ --------

A, E, I, O, U, Y = 0 B, F, P, V = 1 C, G, J, K, Q, S, X, Z = 2 D, T = 3 L = 4 M, N = 5 R = 6

------------------------------------------------------------------------ --------

Now, combine any double numbers into a single instance of that number. Further, if the first number in the Soundex value is the same as the code number for the initial letter, delete the first number. Now, remove all zeros from the Soundex string. Finally, return the first four characters of the end product as the Soundex encoding. If there are less than four characters to be returned, concatenate enough zeros to make the length four.

------------------------------------------------------------------------ --------

As I stated in my reply to you last night, I have found the sound like operator in SQL to be OK at times, and at other times not to be what I needed. Much depends on the data you are working with and how well it is organized (separate fields are best). However, when multiple fields are contained in 1 string such as "Charles Patridge 172 Monce Rd Burlington CT 06013" and you are not able to separate the contents into separate fields such as "Charles" "Patridge" "172 Monce Rd" "Burlington" "CT" "06013"

then the sound like operator does not perform very well, understandably so. When dealing with many different companies/data files, I typically revert to using some form or variation of my "fuzzy" match routines so that I can control what, where and how to fuzzy match fields as each field (First Name, Last Name, Address, State, Zip, Birthdate, etc) has its own properties, rules and/or attributes which should not be intermixed with each other for matching purposes.

Finally, I have started to compile a list of references on record linkages (TIP00219) which can often involve the "soundex" function. This tip can be found on my web site http://www.sconsig.com under Tips and Techniques. I will add more references as I obtain them.

Good Luck and I hope my input as helped you in some small way.

Regards, Charles Patridge Email: Charles_S_Patridge@prodigy.net

You also might try another web page of mine for more related references, links, materials and software, including "dataflux"

http://www.sconsig.com/e-scrub.htm

when using such software as "dataflux" be prepared to do ALOT of testing for desired results, tuning the parameters, learning the rules, building your own rules for increased desired hits, etc.

And MAKE SURE YOU TEST THOROUGHLY to see the quirks.