LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (March 2002, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Thu, 14 Mar 2002 08:03:42 -0800
Reply-To:     Sigurd Wilson Hermansen <hermans1@WESTAT.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Sigurd Wilson Hermansen <hermans1@WESTAT.COM>
Organization: http://groups.google.com/
Subject:      Re: Many many to many merge/joins
Content-Type: text/plain; charset=ISO-8859-1

I couldn't possibly live up to Paul's advance billing, but I will suggest a couple of possibilities for you to consider. First, and probably something that you know already, National Change of Address (NCOA) contractors provide updates for US postal addresses, and, for a relatively small charge per address, standardize all addresses returned. As much as I enjoy the challenge of fuzzy linkage of data tables, for infrequent address standardization tasks I'd rather spend a few thousand to have the NCOA and CASS parts of the task done by specialists than spend tens of thousands either buying or developing programs to do the same thing. If you plan to make CASS an ongoing business venture, that's a different situation. As a second suggestion, I'd advise you to take a close look at Paul's detailed descriptions of indexing methods. I started out using SAS formats for searching until Paul demonstrated beyond a doubt that his indexing methods worked better in every way. That forced me to learn something about indexing. From there we collaborated on a sequence of procedures that first screen large volumes of data for potential matches and then assign scores to pairs of potential matches. We linked to multiple, overlapping indexes during a single scan of each file. You'll find details in the SeUGI paper. Further, in the presentation at SeUGI but not in the published paper, I distinguished unary matching operators and functions from binary and n-ary matching operators and functions. You can create indexes for unary functions in advance.

As for the idea of splitting the files, we found that splitting only one file works better. The split has to match up to memory avaiable to hold multiple indexes. Since each extra split requires another pass through one of the data files, the fewer splits the better. The SAS FIRSTOBS and OBS options work well for this purpose. A basic platform with 200MB of free memory supports several hash indexes of modest bandwidth created for a file of 200K records. With a GB of free memory, you could likely support indexing of 1MM records at a time.

After screening files for possible matches, a scoring procedure invokes a SQL join of the results of screening on matching index values. In this last stage the binary 'similarity score' functions such as the SAS SPEDIS() function help distinguish coincidental matches from likely true matches.

One program can include all of these procedures. It takes time to churn through the files, but the screening procedure reads each record only one time. This means that n(O)=n1+n2, or, in other words, the number of comparisons of records amounts to a additive composition of files with record counts n1 and n2, not a multiplicative one. Processing times for a Cartesian product increase geometrically with the sizes of the files, so a process for which processing times increase additively offers a great advantage. Splitting one file into m equal partitions increases processing time by a factor m(n1m+n2), still much better for large files than n1Xn2.

Incidentally, the screening procedure does not require any form of sorting or ordering of the files. It takes advantage of SAS's ability to romp thru sequential data records at a blazing pace. Oracle and other RDBMS's lack this capability.

Sig

dousk8@HOTMAIL.COM (Nigel Tufnel) wrote in message news:<F100REfjkeI2QQUR5oS0001400b@hotmail.com>... > I am working on a project to compare name and address on 2 files. I believe > that I need to perform a many to many merge/join because I have no match > "key." I need to compare all records on the right table to all records on > the left table to calculate a match score and then keep only those that > score above a certain threshold. > > I developed some code and tested it on small files. Small samples of my > files work well, but as you have guessed, the runtime explodes when the file > sizes grow. For example, the full size files are both approximately 10 MM > records each. A many to many merge/join requires 100 trillion comparisons! > Yikes. > > I've figured that I can reduce the number of comparisons if I believe one > piece of address information is accurate. That is, I'm assuming that the > zip code on both files is sufficiently accurate. So, if there are roughly > 40k zips in the US, I'd have about 250 records per zip (just quick back of > the envelope type calculations). A many to many join on two files each with > 250 records requires 62,500 comparisons, much more reasonable. The problem > is that I need to repeat this process 40,000 times for a total of 2.5 > billion comparisons. I figure this approach may take a couple of days to > run, which of course is 40,000 times better than the first approach > requiring about 219 years! > > Anyway, I could write a macro to split each file into 40k individual files > and then perform individual many to many merges/joins and then assemble the > output data back together. I know, however, there are likely more clever > solutions lurking out there. For example, I just finished reading a SAS tip > on http://www.sconsig.com/ where Paul Dorfman (who else) demonstrates a many > to many merge data step algorithm (very cool). > > > Anyway, any help is very appriciated here. > > Thanks, > Nigel > > _________________________________________________________________ > Send and receive Hotmail on your mobile device: http://mobile.msn.com


Back to: Top of message | Previous page | Main SAS-L page