LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (April 2009, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 29 Apr 2009 15:13:17 -0700
Reply-To:     Dale McLerran <stringplayer_2@YAHOO.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Dale McLerran <stringplayer_2@YAHOO.COM>
Subject:      Re: fuzzy match problem
Content-Type: text/plain; charset=utf-8

It should be noted that the SPEDIS function is asymmetric which means that SPEDIS(var1,var2)=SPEDIS(var2,var1) is NOT TRUE for all values of var1, var2. Since the SPEDIS function returns a normalized cost for converting from var2 to var1 and if there is no a priori reason to believe that var1 is the "correct" string, then it may be advisable to compute the costs of going both directions and average the two costs.

Cost = mean( (1 - (length(compress(var1)) * spedis(compress(var1),compress(var2)) / 2400)), (1 - (length(compress(var2)) * spedis(compress(var2),compress(var1)) / 2400)));

Dale

--------------------------------------- Dale McLerran Fred Hutchinson Cancer Research Center mailto: dmclerra@NO_SPAMfhcrc.org Ph: (206) 667-2926 Fax: (206) 667-5977 ---------------------------------------

--- On Wed, 4/29/09, Sigurd Hermansen <HERMANS1@WESTAT.COM> wrote:

> From: Sigurd Hermansen <HERMANS1@WESTAT.COM> > Subject: Re: fuzzy match problem > To: SAS-L@LISTSERV.UGA.EDU > Date: Wednesday, April 29, 2009, 1:39 PM > Fuzzy matching and artificial > intelligence won't necessarily return the required results. > All methods currently in use have some likelihood of > returning correct results and some likelihood of returning > an incorrect result. You likely know that good linkage > method has a fairly high likelihood of the former and a > relatively low likelihood of the latter. > > SAS provides several functions for fuzzy matching. I find a > modified version of SPEDIS() a good way to generate a match > "score" for the comparison of values of two variables. > > SPEDIS() computes a total cost of rearranging characters in > one string to match characters in another string. Simple > rearrangements have a small cost, and complex rearrangements > have a high cost). This expression computes a match score > (the closer the match, the higher the score) for the string > values of the variables: > >         >   (1 - (length(compress(var1)) * >         >  >    spedis(compress(var1),compress(var2)) / >         >     2400)) > > The 2400 weight in the expression requires a very close > match for a score of 0.95 or higher. Those applying the > expression to pairs of strings select a cut-off that > balances the costs of false matches against costs of not > finding correct matches. I wouldn't expect too much from a > first attempt at separating correct from false matches. > S > > -----Original Message----- > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] > On Behalf Of Terry He > Sent: Wednesday, April 29, 2009 4:12 PM > To: SAS-L@LISTSERV.UGA.EDU > Subject: fuzzy match problem > > > I have two variables. I am trying to match one variable to > another. For example, one list has “10-K WIZARD > TECHNOLOGY LLC†and the other has “10K WIZARD > TECHNOLOGY LLCâ€. The vlookup function in excel will not > necessarily return the required result in this case. how > could I do it in SAS? Here is some example data: > Var1  Var2 > 101 CALIFORNIA VENTURE  @STAKE, INC > 10K WIZARD TECHNOLOGY LLC  10-K WIZARD > TECHNOLOGY LLC > 13D RESEARCH INC  1E LIMITED > 2008 MIECF  29WEST INC. > 2C COMERCIO E IMPORTACAO DE  3 TIER TECHNOLOGY > INC. > 2K ADVISORS LLC  33-6 CONSULTANCY LTD > 3 B CLIM  360 CONSULTING INC. > 3 REASONS LTD.  360 RELOCATIONS LIMITED > 3DADVISORS LLC  3SCOM Y.K. > 3V CAPITAL LIMITED  3T SYSTEMS, INC > 4 TABELIAO DE PROTESTO DE  4CAST LIMITED > 401K COMPANY  5B TECHNOLOGIES CORP > A G EDWARDS INC  6FIGUREJOBS.COM LLC > A V ARKANSAS  7 CITY LEARNING LIMITED > AAA LAUNDROMAT  9-20 RECRUITMENT LTD. > AAA RESEARCHONE FINANCIAL  A. EPSTEIN & > SONS > INTERNATIONAL, INC. > ABATEX INDUSTRIA E COMERCIO  A. PAPPAJOHN > COMPANY > ABG SUNDAL COLLIER INC  A.S.A.INTERNATIONAL > HOLDINGS LIMITED > ABN AMRO HOLDING NV  A1 EXPRESS DELIVERY > SERVICE INC. >


Back to: Top of message | Previous page | Main SAS-L page