LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (March 2002, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 13 Mar 2002 10:54:40 -0500
Reply-To:     Kevin Delaney <khd8@CDC.GOV>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Kevin Delaney <khd8@CDC.GOV>
Subject:      Re: Many many to many merge/joins

Does anyone know of a site that has the SEUGI proceedings online??

Kevin

On Wed, 13 Mar 2002 05:50:24 +0000, Paul Dorfman <paul_dorfman@HOTMAIL.COM> wrote:

>Nigel, > >PMD might know a thing or two about coding stuff in a Data step, but what >you need is a *strategy* of detecting the highest matching score with the >least amount of computer work. It is not a trivial task at all. However, you >are at luck, as one of our fellow SAS-lers, Sigurd Wilson Hermansen, is the >world-leading expert in the field of fuzzy and/or probabilisitc record >linkage, a specialist whose expertise goes far beyond evident. Having worked >with SWH on a couple of projects of this sort, PMD can attest with a full >measure of confidence that SWH knows his fuzz. I am sure that, despite being >quite busy running things in the SAS Mecca, he will chime in. > >At any rate, were I you, I would dig out SWH's paper on probabilistic >linkage from the SEUGI Proceedings in Florence and start delving head first. > >Kind regards, >===================== >Paul Michael Dorfman >Jacksonville, FL >===================== > >----Original Message Follows---- >From: Nigel Tufnel <dousk8@HOTMAIL.COM> >Reply-To: Nigel Tufnel <dousk8@HOTMAIL.COM> >To: SAS-L@LISTSERV.UGA.EDU >Subject: Many many to many merge/joins >Date: Tue, 12 Mar 2002 20:24:36 -0500 > >I am working on a project to compare name and address on 2 files. I believe >that I need to perform a many to many merge/join because I have no match >"key." I need to compare all records on the right table to all records on >the left table to calculate a match score and then keep only those that >score above a certain threshold. > >I developed some code and tested it on small files. Small samples of my >files work well, but as you have guessed, the runtime explodes when the file >sizes grow. For example, the full size files are both approximately 10 MM >records each. A many to many merge/join requires 100 trillion comparisons! >Yikes. > >I've figured that I can reduce the number of comparisons if I believe one >piece of address information is accurate. That is, I'm assuming that the >zip code on both files is sufficiently accurate. So, if there are roughly >40k zips in the US, I'd have about 250 records per zip (just quick back of >the envelope type calculations). A many to many join on two files each with >250 records requires 62,500 comparisons, much more reasonable. The problem >is that I need to repeat this process 40,000 times for a total of 2.5 >billion comparisons. I figure this approach may take a couple of days to >run, which of course is 40,000 times better than the first approach >requiring about 219 years! > >Anyway, I could write a macro to split each file into 40k individual files >and then perform individual many to many merges/joins and then assemble the >output data back together. I know, however, there are likely more clever >solutions lurking out there. For example, I just finished reading a SAS tip >on http://www.sconsig.com/ where Paul Dorfman (who else) demonstrates a many >to many merge data step algorithm (very cool). > > >Anyway, any help is very appriciated here. > >Thanks, >Nigel > >_________________________________________________________________ >Send and receive Hotmail on your mobile device: http://mobile.msn.com > > >_________________________________________________________________ >Chat with friends online, try MSN Messenger: http://messenger.msn.com


Back to: Top of message | Previous page | Main SAS-L page