Date: Wed, 13 Mar 2002 10:54:40 -0500
Reply-To: Kevin Delaney <khd8@CDC.GOV>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Kevin Delaney <khd8@CDC.GOV>
Subject: Re: Many many to many merge/joins
Does anyone know of a site that has the SEUGI proceedings online??
On Wed, 13 Mar 2002 05:50:24 +0000, Paul Dorfman <paul_dorfman@HOTMAIL.COM>
>PMD might know a thing or two about coding stuff in a Data step, but what
>you need is a *strategy* of detecting the highest matching score with the
>least amount of computer work. It is not a trivial task at all. However,
>are at luck, as one of our fellow SAS-lers, Sigurd Wilson Hermansen, is the
>world-leading expert in the field of fuzzy and/or probabilisitc record
>linkage, a specialist whose expertise goes far beyond evident. Having
>with SWH on a couple of projects of this sort, PMD can attest with a full
>measure of confidence that SWH knows his fuzz. I am sure that, despite
>quite busy running things in the SAS Mecca, he will chime in.
>At any rate, were I you, I would dig out SWH's paper on probabilistic
>linkage from the SEUGI Proceedings in Florence and start delving head
>Paul Michael Dorfman
>----Original Message Follows----
>From: Nigel Tufnel <dousk8@HOTMAIL.COM>
>Reply-To: Nigel Tufnel <dousk8@HOTMAIL.COM>
>Subject: Many many to many merge/joins
>Date: Tue, 12 Mar 2002 20:24:36 -0500
>I am working on a project to compare name and address on 2 files. I
>that I need to perform a many to many merge/join because I have no match
>"key." I need to compare all records on the right table to all records on
>the left table to calculate a match score and then keep only those that
>score above a certain threshold.
>I developed some code and tested it on small files. Small samples of my
>files work well, but as you have guessed, the runtime explodes when the
>sizes grow. For example, the full size files are both approximately 10 MM
>records each. A many to many merge/join requires 100 trillion comparisons!
>I've figured that I can reduce the number of comparisons if I believe one
>piece of address information is accurate. That is, I'm assuming that the
>zip code on both files is sufficiently accurate. So, if there are roughly
>40k zips in the US, I'd have about 250 records per zip (just quick back of
>the envelope type calculations). A many to many join on two files each
>250 records requires 62,500 comparisons, much more reasonable. The problem
>is that I need to repeat this process 40,000 times for a total of 2.5
>billion comparisons. I figure this approach may take a couple of days to
>run, which of course is 40,000 times better than the first approach
>requiring about 219 years!
>Anyway, I could write a macro to split each file into 40k individual files
>and then perform individual many to many merges/joins and then assemble the
>output data back together. I know, however, there are likely more clever
>solutions lurking out there. For example, I just finished reading a SAS
>on http://www.sconsig.com/ where Paul Dorfman (who else) demonstrates a
>to many merge data step algorithm (very cool).
>Anyway, any help is very appriciated here.
>Send and receive Hotmail on your mobile device: http://mobile.msn.com
>Chat with friends online, try MSN Messenger: http://messenger.msn.com