Date: Wed, 29 Apr 2009 15:13:17 -0700
Reply-To: Dale McLerran <stringplayer_2@YAHOO.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Dale McLerran <stringplayer_2@YAHOO.COM>
Subject: Re: fuzzy match problem
Content-Type: text/plain; charset=utf-8
It should be noted that the SPEDIS function is asymmetric which
means that SPEDIS(var1,var2)=SPEDIS(var2,var1) is NOT TRUE for
all values of var1, var2. Since the SPEDIS function returns
a normalized cost for converting from var2 to var1 and if
there is no a priori reason to believe that var1 is the "correct"
string, then it may be advisable to compute the costs of going
both directions and average the two costs.
Cost = mean( (1 - (length(compress(var1)) *
spedis(compress(var1),compress(var2)) / 2400)),
(1 - (length(compress(var2)) *
spedis(compress(var2),compress(var1)) / 2400)));
Dale
---------------------------------------
Dale McLerran
Fred Hutchinson Cancer Research Center
mailto: dmclerra@NO_SPAMfhcrc.org
Ph: (206) 667-2926
Fax: (206) 667-5977
---------------------------------------
--- On Wed, 4/29/09, Sigurd Hermansen <HERMANS1@WESTAT.COM> wrote:
> From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
> Subject: Re: fuzzy match problem
> To: SAS-L@LISTSERV.UGA.EDU
> Date: Wednesday, April 29, 2009, 1:39 PM
> Fuzzy matching and artificial
> intelligence won't necessarily return the required results.
> All methods currently in use have some likelihood of
> returning correct results and some likelihood of returning
> an incorrect result. You likely know that good linkage
> method has a fairly high likelihood of the former and a
> relatively low likelihood of the latter.
>
> SAS provides several functions for fuzzy matching. I find a
> modified version of SPEDIS() a good way to generate a match
> "score" for the comparison of values of two variables.
>
> SPEDIS() computes a total cost of rearranging characters in
> one string to match characters in another string. Simple
> rearrangements have a small cost, and complex rearrangements
> have a high cost). This expression computes a match score
> (the closer the match, the higher the score) for the string
> values of the variables:
>
> Â Â Â Â Â Â Â Â
> Â Â (1 - (length(compress(var1)) *
> Â Â Â Â Â Â Â Â
> Â
> Â Â Â spedis(compress(var1),compress(var2)) /
> Â Â Â Â Â Â Â Â
> Â Â Â Â 2400))
>
> The 2400 weight in the expression requires a very close
> match for a score of 0.95 or higher. Those applying the
> expression to pairs of strings select a cut-off that
> balances the costs of false matches against costs of not
> finding correct matches. I wouldn't expect too much from a
> first attempt at separating correct from false matches.
> S
>
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU]
> On Behalf Of Terry He
> Sent: Wednesday, April 29, 2009 4:12 PM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: fuzzy match problem
>
>
> I have two variables. I am trying to match one variable to
> another. For example, one list has “10-K WIZARD
> TECHNOLOGY LLC†and the other has “10K WIZARD
> TECHNOLOGY LLCâ€. The vlookup function in excel will not
> necessarily return the required result in this case. how
> could I do it in SAS? Here is some example data:
> Var1Â Â Var2
> 101 CALIFORNIA VENTUREÂ Â @STAKE, INC
> 10K WIZARD TECHNOLOGY LLCÂ Â 10-K WIZARD
> TECHNOLOGY LLC
> 13D RESEARCH INCÂ Â 1E LIMITED
> 2008 MIECFÂ Â 29WEST INC.
> 2C COMERCIO E IMPORTACAO DEÂ Â 3 TIER TECHNOLOGY
> INC.
> 2K ADVISORS LLCÂ Â 33-6 CONSULTANCY LTD
> 3 B CLIMÂ Â 360 CONSULTING INC.
> 3 REASONS LTD.  360 RELOCATIONS LIMITED
> 3DADVISORS LLCÂ Â 3SCOM Y.K.
> 3V CAPITAL LIMITEDÂ Â 3T SYSTEMS, INC
> 4 TABELIAO DE PROTESTO DEÂ Â 4CAST LIMITED
> 401K COMPANYÂ Â 5B TECHNOLOGIES CORP
> A G EDWARDS INCÂ Â 6FIGUREJOBS.COM LLC
> A V ARKANSASÂ Â 7 CITY LEARNING LIMITED
> AAA LAUNDROMATÂ Â 9-20 RECRUITMENT LTD.
> AAA RESEARCHONE FINANCIALÂ Â A. EPSTEIN &
> SONS
> INTERNATIONAL, INC.
> ABATEX INDUSTRIA E COMERCIOÂ Â A. PAPPAJOHN
> COMPANY
> ABG SUNDAL COLLIER INCÂ Â A.S.A.INTERNATIONAL
> HOLDINGS LIMITED
> ABN AMRO HOLDING NVÂ Â A1 EXPRESS DELIVERY
> SERVICE INC.
>
|