LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (August 2001, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Thu, 2 Aug 2001 18:51:45 -0400
Reply-To:   "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM>
Subject:   Re: Match-Merge a Small to Large Dataset
Comments:   To: "Kitzmann, Daniel J." <kitzmann.daniel@MAYO.EDU>
Content-Type:   text/plain; charset=iso-8859-1

Daniel,

With so few lookup keys and relatively short (1E7) key range, there is a number of possibilities, depending on your system resources.

1. If you have plenty of memory, use the key-indexed search. To account for 1E7 possible numeric keys, you need a numeric array sized as [0:9999999], each cell taking up 8 bytes of memory. On my current Real Computer, the test step below maxs out at 82 Mb:

1 DATA _NULL_ ; 2 ARRAY A(10000000) _TEMPORARY_ ; 3 RUN ; NOTE: The DATA statement used 0.23 CPU seconds and 83989K.

If your system people are OK with you using a region of about 100 Mb region (just to be on the sure side, have enough memory left for buffers, etc.), then go with the key-indexing: It is both simplest and by far the fastest seraching method. Actually, at my current site, I have been using up to 1 Gb for a single SAS batch - so far, unpunished.

2. Of course, on *your* system, your system folks can say 'Fat chance'. Then you may recall that binary information (present-absent in this case) does not have to be stored in 8 bytes, i.e. 64 bits. It can be safely plugged in just 1 bit. Potentially, it saves you 64 times the memory compared to the key-indexing. In S/390 reality, it happens to be 56, for it is how many bits the mantissa of a numeric SAS variable has on the mainframe, and that is all that can be used (leave the exponent alone!). To bitmap 1E7 consecutive integers, we need 1E7 bits, i.e. 1.2 Mb, that is, nothing. You can find ready-to-go bitmapping code precisely matching your task both in SUGI 26 and in sas-l archives.

Bitmap will work slightly slower than the key-indexed table, for it requires more computations. However, it has advantages, too. If your key extends 1 digits, it would be no big deal, and a 12 Mb bitmap will swallow it with no problem; and you can handle even 9 digits if there is 120 Mb to use (pretty much trifle nowadays). You can store the bitmap in a file, and then simply load it in memory as necessary when subsetting the next file.

3. Then, of course, you can hash. With only 10,000 lookup keys, there is no need to resort to more memory-efficient, but also more complex, coalecsed list collision resolution policy designed to handle relatively tight tables. Instead, you can set the load factor as low as 0.1 (that is, make the table 90% sparse) and practically guarantee that you will have no collisions at all to resolve! In such a situation, the simplest collision resolution policy, linear probing, will work just fine - and be fastest, too, since it is less computationally intensive than both link listing and double hashing. Once again, in the paper, you will find code applied to your situation (I actually used exactly this sort of task to demonstrate/compare different direct-addresing lookup methods).

4. I see nothing wrong with compiling the 10,000 lookup keys in a format via CNTLIN= and using the latter to search for each key coming from the "large" file. With so few keys, the format compile time and its memory consumption are insignificant (they become extremely significant when the number of keys starts exceeding ~100,000.) And even though formats search slower than a very sparse hash table, it might be much less noticeable if the task is I/O bound - for example, if the "large" file has a long "tail".

5. Finally, you can simply use SQL equijoin and specify a large BUFFERSIZE=. I bet that already as low as with BUFFERSIZE=100000, the SQL optimizer will decide to memory-hash the 10,000 keys for you and effectively use the method you have in mind - automatically. Specify the system option MSGLEVEL=I and _METHOD option on the SQL. If in the log you see SQXJHSH among "the SQL execution methods chosen", that is it. Again, the SQL's internal hash will not work as fast as hand-coded hashes (especially key-indexing), but it will run faster than a format, and then - heck, you have got virtually nothing to code!

Kind regards, ====================== Paul M. Dorfman Jacksonville, Fl ======================

-----Original Message----- From: Kitzmann, Daniel J. [mailto:kitzmann.daniel@MAYO.EDU] Sent: Thursday, August 02, 2001 5:52 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Match-Merge a Small to Large Dataset

Dear SAS-Lers:

Just soliciting advice on the preferred approach to the following problem:

I want to match-merge a SMALLd dataset (~10,000 obs) containing a unique KEYv variable with a series of LARGEd datasets (each ~2M obs), which contain multiple records with repeats on the same KEYvs. The desired MERGEd dataset will contain all records, including repeats on KEYv, from LARGEd that match on KEYv in SMALLd.

I've been reading the SAS-L archive literature on Key-Searching and Hashing, chiefly those posts of Paul Dorfman, as well as the relevant published SUGI papers. I am pleased to report that I think I am, for an autodidact programmer anyway, gradually albeit laboredly acquiring a verstehen for that is going on and why. Before proceeding too far, however, I just want to confirm that I am looking down the apt alley here.

Some additional facts: I'm working in the OS/390 batch environment. KEY is indeed an integer, and even though the number of KEYv obs in SMALLd is merely around 10,000, KEYv's theoretical range spans seven digits. Am I correct in therefore supposing that the Coalescing List Hashing is the method to employ? Thank you kindly in advance.

Cordially, Dan kitzmann.daniel@mayo.edu

Blue Cross Blue Shield of Florida, Inc., and its subsidiary and affiliate companies are not responsible for errors or omissions in this e-mail message. Any personal comments made in this e-mail do not reflect the views of Blue Cross Blue Shield of Florida, Inc.


Back to: Top of message | Previous page | Main SAS-L page