Date: Thu, 14 Mar 2002 08:03:42 -0800
Reply-To: Sigurd Wilson Hermansen <hermans1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sigurd Wilson Hermansen <hermans1@WESTAT.COM>
Organization: http://groups.google.com/
Subject: Re: Many many to many merge/joins
Content-Type: text/plain; charset=ISO-8859-1
I couldn't possibly live up to Paul's advance billing, but I will
suggest a couple of possibilities for you to consider. First, and
probably something that you know already, National Change of Address
(NCOA) contractors provide updates for US postal addresses, and, for a
relatively small charge per address, standardize all addresses
returned. As much as I enjoy the challenge of fuzzy linkage of data
tables, for infrequent address standardization tasks I'd rather spend
a few thousand to have the NCOA and CASS parts of the task done by
specialists than spend tens of thousands either buying or developing
programs to do the same thing. If you plan to make CASS an ongoing
business venture, that's a different situation. As a second
suggestion, I'd advise you to take a close look at Paul's detailed
descriptions of indexing methods. I started out using SAS formats for
searching until Paul demonstrated beyond a doubt that his indexing
methods worked better in every way. That forced me to learn something
about indexing. From there we collaborated on a sequence of procedures
that first screen large volumes of data for potential matches and then
assign scores to pairs of potential matches. We linked to multiple,
overlapping indexes during a single scan of each file. You'll find
details in the SeUGI paper. Further, in the presentation at SeUGI but
not in the published paper, I distinguished unary matching operators
and functions from binary and n-ary matching operators and functions.
You can create indexes for unary functions in advance.
As for the idea of splitting the files, we found that splitting only
one file works better. The split has to match up to memory avaiable to
hold multiple indexes. Since each extra split requires another pass
through one of the data files, the fewer splits the better. The SAS
FIRSTOBS and OBS options work well for this purpose. A basic platform
with 200MB of free memory supports several hash indexes of modest
bandwidth created for a file of 200K records. With a GB of free
memory, you could likely support indexing of 1MM records at a time.
After screening files for possible matches, a scoring procedure
invokes a SQL join of the results of screening on matching index
values. In this last stage the binary 'similarity score' functions
such as the SAS SPEDIS() function help distinguish coincidental
matches from likely true matches.
One program can include all of these procedures. It takes time to
churn through the files, but the screening procedure reads each record
only one time. This means that n(O)=n1+n2, or, in other words, the
number of comparisons of records amounts to a additive composition of
files with record counts n1 and n2, not a multiplicative one.
Processing times for a Cartesian product increase geometrically with
the sizes of the files, so a process for which processing times
increase additively offers a great advantage. Splitting one file into
m equal partitions increases processing time by a factor m(n1m+n2),
still much better for large files than n1Xn2.
Incidentally, the screening procedure does not require any form of
sorting or ordering of the files. It takes advantage of SAS's ability
to romp thru sequential data records at a blazing pace. Oracle and
other RDBMS's lack this capability.
Sig
dousk8@HOTMAIL.COM (Nigel Tufnel) wrote in message news:<F100REfjkeI2QQUR5oS0001400b@hotmail.com>...
> I am working on a project to compare name and address on 2 files. I believe
> that I need to perform a many to many merge/join because I have no match
> "key." I need to compare all records on the right table to all records on
> the left table to calculate a match score and then keep only those that
> score above a certain threshold.
>
> I developed some code and tested it on small files. Small samples of my
> files work well, but as you have guessed, the runtime explodes when the file
> sizes grow. For example, the full size files are both approximately 10 MM
> records each. A many to many merge/join requires 100 trillion comparisons!
> Yikes.
>
> I've figured that I can reduce the number of comparisons if I believe one
> piece of address information is accurate. That is, I'm assuming that the
> zip code on both files is sufficiently accurate. So, if there are roughly
> 40k zips in the US, I'd have about 250 records per zip (just quick back of
> the envelope type calculations). A many to many join on two files each with
> 250 records requires 62,500 comparisons, much more reasonable. The problem
> is that I need to repeat this process 40,000 times for a total of 2.5
> billion comparisons. I figure this approach may take a couple of days to
> run, which of course is 40,000 times better than the first approach
> requiring about 219 years!
>
> Anyway, I could write a macro to split each file into 40k individual files
> and then perform individual many to many merges/joins and then assemble the
> output data back together. I know, however, there are likely more clever
> solutions lurking out there. For example, I just finished reading a SAS tip
> on http://www.sconsig.com/ where Paul Dorfman (who else) demonstrates a many
> to many merge data step algorithm (very cool).
>
>
> Anyway, any help is very appriciated here.
>
> Thanks,
> Nigel
>
> _________________________________________________________________
> Send and receive Hotmail on your mobile device: http://mobile.msn.com
|