With so few lookup keys and relatively short (1E7) key range, there is a
number of possibilities, depending on your system resources.
1. If you have plenty of memory, use the key-indexed search. To account for
1E7 possible numeric keys, you need a numeric array sized as [0:9999999],
each cell taking up 8 bytes of memory. On my current Real Computer, the
test step below maxs out at 82 Mb:
1 DATA _NULL_ ;
2 ARRAY A(10000000) _TEMPORARY_ ;
3 RUN ;
NOTE: The DATA statement used 0.23 CPU seconds and 83989K.
If your system people are OK with you using a region of about 100 Mb region
(just to be on the sure side, have enough memory left for buffers, etc.),
then go with the key-indexing: It is both simplest and by far the fastest
seraching method. Actually, at my current site, I have been using up to 1 Gb
for a single SAS batch - so far, unpunished.
2. Of course, on *your* system, your system folks can say 'Fat chance'. Then
you may recall that binary information (present-absent in this case) does
not have to be stored in 8 bytes, i.e. 64 bits. It can be safely plugged in
just 1 bit. Potentially, it saves you 64 times the memory compared to the
key-indexing. In S/390 reality, it happens to be 56, for it is how many bits
the mantissa of a numeric SAS variable has on the mainframe, and that is all
that can be used (leave the exponent alone!). To bitmap 1E7 consecutive
integers, we need 1E7 bits, i.e. 1.2 Mb, that is, nothing. You can find
ready-to-go bitmapping code precisely matching your task both in SUGI 26 and
in sas-l archives.
Bitmap will work slightly slower than the key-indexed table, for it requires
more computations. However, it has advantages, too. If your key extends 1
digits, it would be no big deal, and a 12 Mb bitmap will swallow it with no
problem; and you can handle even 9 digits if there is 120 Mb to use (pretty
much trifle nowadays). You can store the bitmap in a file, and then simply
load it in memory as necessary when subsetting the next file.
3. Then, of course, you can hash. With only 10,000 lookup keys, there is no
need to resort to more memory-efficient, but also more complex, coalecsed
list collision resolution policy designed to handle relatively tight tables.
Instead, you can set the load factor as low as 0.1 (that is, make the table
90% sparse) and practically guarantee that you will have no collisions at
all to resolve! In such a situation, the simplest collision resolution
policy, linear probing, will work just fine - and be fastest, too, since it
is less computationally intensive than both link listing and double hashing.
Once again, in the paper, you will find code applied to your situation (I
actually used exactly this sort of task to demonstrate/compare different
direct-addresing lookup methods).
4. I see nothing wrong with compiling the 10,000 lookup keys in a format via
CNTLIN= and using the latter to search for each key coming from the "large"
file. With so few keys, the format compile time and its memory consumption
are insignificant (they become extremely significant when the number of keys
starts exceeding ~100,000.) And even though formats search slower than a
very sparse hash table, it might be much less noticeable if the task is I/O
bound - for example, if the "large" file has a long "tail".
5. Finally, you can simply use SQL equijoin and specify a large BUFFERSIZE=.
I bet that already as low as with BUFFERSIZE=100000, the SQL optimizer will
decide to memory-hash the 10,000 keys for you and effectively use the method
you have in mind - automatically. Specify the system option MSGLEVEL=I and
_METHOD option on the SQL. If in the log you see SQXJHSH among "the SQL
execution methods chosen", that is it. Again, the SQL's internal hash will
not work as fast as hand-coded hashes (especially key-indexing), but it will
run faster than a format, and then - heck, you have got virtually nothing to
Paul M. Dorfman
From: Kitzmann, Daniel J. [mailto:kitzmann.daniel@MAYO.EDU]
Sent: Thursday, August 02, 2001 5:52 PM
Subject: Match-Merge a Small to Large Dataset
Just soliciting advice on the preferred approach to the following problem:
I want to match-merge a SMALLd dataset (~10,000 obs) containing a unique
KEYv variable with a series of LARGEd datasets (each ~2M obs), which contain
multiple records with repeats on the same KEYvs. The desired MERGEd dataset
will contain all records, including repeats on KEYv, from LARGEd that match
on KEYv in SMALLd.
I've been reading the SAS-L archive literature on Key-Searching and Hashing,
chiefly those posts of Paul Dorfman, as well as the relevant published SUGI
papers. I am pleased to report that I think I am, for an autodidact
programmer anyway, gradually albeit laboredly acquiring a verstehen for that
is going on and why. Before proceeding too far, however, I just want to
confirm that I am looking down the apt alley here.
Some additional facts: I'm working in the OS/390 batch environment. KEY is
indeed an integer, and even though the number of KEYv obs in SMALLd is
merely around 10,000, KEYv's theoretical range spans seven digits. Am I
correct in therefore supposing that the Coalescing List Hashing is the
method to employ? Thank you kindly in advance.
Blue Cross Blue Shield of Florida, Inc., and its subsidiary and
affiliate companies are not responsible for errors or omissions in this e-mail message. Any personal comments made in this e-mail do not reflect the views of Blue Cross Blue Shield of Florida, Inc.