Date: Tue, 15 Dec 1998 21:15:40 -0500
Reply-To: pdorfma@FL6612MAILEX4.UCS.ATT.COM
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: pdorfma@FL6612MAILEX4.UCS.ATT.COM
Subject: XMAS SASTip: Quick Table Lookup by Hashing
Content-Type: text/plain; charset="iso-8859-1"
Dear SAS-Lers,
It is about time to exchange SAS gifts. Here is mine to you.
Consider the following (toooo very common) problem:
One file, SMALL, contains a variable SKEY. Another file, LARGE, contains a
variable LKEY and maybe other variables, for instance, SMTHELSE. Within the
limits of SAS, what is the most efficient way to match SMALL and LARGE by
SKEY and LKEY?
For certainty, assume that the number of records in LARGE is &N_LARGE =
10,000,000 and that the number of records in SMALL, &N_SMALL, may vary from
1,000 to 2,000,000. Let us also limit the case to KEYs being non-negative
integers. (In actuality, it is not a limitation, since any key, numeric or
character, can be converted to a natural number. We will discuss it at the
end). The situation can be simulated, for instance, like that:
%LET NSMALL = 1E5;
%LET NLARGE = 1E7;
%LET NRANGE = 1E9;
DATA SMALL (KEEP=SKEY) LARGE (KEEP=LKEY SMTHELSE);
DO I=1 TO &NSMALL;
SKEY = CEIL(RANUNI(1)*&NRANGE);
OUTPUT SMALL;
END;
DO I=1 TO &NLARGE;
LKEY = CEIL(RANUNI(2)*&NRANGE);
SMTHELSE = 'SMTHELSE';
OUTPUT LARGE;
END;
RUN;
Hence, KEY can assume all values from 0 to &NRANGE. If we reject methods not
based on storing a lookup object entirely in memory, it will leave us with
either using a format loaded with the distinct keys from SMALL, or a binary
search on a sorted array loaded with the keys from SMALL. Binary search is
faster in the load phase and easier on memory, but it is about two times
slower in the lookup phase. Is there a method that would search faster than
format search, load on par with binary search, and be good enough on memory
to accommodate a lookup table with 2 million plus entries?
The answer is YES, and it is based on extending the following idea. Imagine
that KEY contains no more than 5 digits, that is &NRANGE is 1E5. In this
case, we can allocate a temporary array with 100000 entries and index it
with the KEY values from SMALL. Then, every time a KEY value LARGE is read,
we will immediately know whether the search was successful or not by simply
examining the bucket whose number is the KEY value:
DATA MATCH;
ARRAY HKEY (0:100000) _TEMPORARY_;
DO UNTIL (SEND);
SET SMALL END=SEND;
HKEY (SKEY) = 1;
END;
DO UNTIL (LEND);
SET LARGE END=LEND;
IF HKEY (LKEY) THEN OUTPUT;
END;
RUN;
No other method can beat the simplicity and speed of this, "key-indexed",
search in its domain of applicability: it makes no more than 1 comparison
per search, success or failure. Too bad it is limited to short keys only!
Indeed, a 9-digit key, for example, would require creating a 1-billion array
which is practically impossible; besides, most of its space would be wasted.
Fortunately, there is a way around this problem. Let us allocate a temporary
array HKEY(0:&HSIZE) with &HSIZE equal to some prime number greater than
&NSMALL, and call the ratio &NSMALL/&HSIZE "load factor". If we then divide
each SKEY from SMALL by &HSIZE, the remainder, i.e. H = MOD(SKEY,&HSIZE),
will only range from 0 to &HSIZE-1. Now, we can try to use the value of H to
index array HKEY, in other words, to HASH keys to only as many as &HSIZE
locations. It is VERY probable that some keys will yield the same remainder
and, therefore, hash to the same HKEY(H) address, and it is not too good.
But as &HSIZE is a prime number, it is VERY improbable that many keys will
hash to one particular location, and it is very good! We will have to figure
out, though, how to resolve these collisions. One way would be to store a
colliding key in some empty node of the array HKEY and create a link
pointing to the node whence the key would have come if the node had been
empty. Thus, we will maintain &HSIZE linked lists, one for each possible
hash code H, with all lookup keys residing entirely inside the array HKEY;
another array, LINK(0:&HSIZE) will be needed to store the links. Then, if
the file LARGE supplies an item LKEY to search, the latter could be carried
out as follows:
1. H = MOD(LKEY,&HSIZE).
2. If HKEY (H) is empty, the search failed.
3. Else do a serial search on the list whose head begins at H.
The AVERAGE list length is equal to load factor, that is, less than 1!
Therefore, it could expected intuitively that making the table sparser will
result in a faster search, with linear increase in the memory usage. Now all
we have to do is express the above in the SAS language:
*** Pick some load factor less than 1 ;
%LET LOAD = .5;
*** Find first prime number greater than lookup;
*** dataset size divided by load factor;
DATA _NULL_;
P = CEIL(SIZE / &LOAD);
DO UNTIL (J = U + 1);
P + 1;
U = CEIL(SQRT(P));
DO J=2 TO U;
IF MOD(P,J) = 0 THEN LEAVE;
END;
END;
CALL SYMPUT('HSIZE',LEFT(PUT(P,BEST.)));
STOP;
SET SMALL NOBS=SIZE;
RUN;
*** Matching by coalesced list hashing;
DATA MATCH (KEEP=KEY SMTHELSE);
ARRAY HKEY (0 : &HSIZE) _TEMPORARY_; *** Hash table;
ARRAY LINK (0 : &HSIZE) _TEMPORARY_; *** Link table;
R = &HSIZE; *** Counter to help find empty nodes;
*** Load hash table;
DO UNTIL (LEND);
SET SMALL END=LEND;
H = MOD(SKEY, &HSIZE) + 1; *** Hash the key;
IF HKEY (H) > . THEN DO; *** If node not empty;
SETLINK:
IF SKEY = HKEY(H) THEN CONTINUE; *** duplicate key!;
IF LINK(H) NE 0 THEN DO;
H = LINK(H);
GOTO SETLINK;
END;
DO WHILE (HKEY(R) > .); *** find empty node;
R +-1;
END;
LINK(H) = R;
H = R;
END;
HKEY (H) = SKEY; *** Insert key;
LINK (H) = 0; *** and mark node as occupied;
END;
*** Perform hash search;
DO UNTIL (MEND);
SET LARGE END=MEND;
FOUND = .;
H = MOD(LKEY, &HSIZE) + 1; *** hash;
*** if node not empty search list;
IF HKEY (H) > . THEN DO;
SRCHLIST:
IF HKEY(H) = LKEY THEN FOUND = 1;
ELSE IF LINK(H) NE 0 THEN DO;
H = LINK(H);
GOTO SRCHLIST;
END;
END;
IF FOUND THEN OUTPUT;
END;
RUN;
An attentive SAS-Ler will note that the two arrays, HKEY and LINK, could be
combined into one 2-dimensional array. As it results in a slightly slower
execution, I left the things intact. However, I decided to make it easier on
myself in the future by encapsulating the code into 3 autocall macros: one
for determining the hash table size, one for loading the table, and one for
looking it up.
%MACRO HSIZE (DATA=, LOAD=.5);
%GLOBAL HSIZE;
DATA _NULL_;
P = CEIL(SIZE/&LOAD);
DO UNTIL (J = U + 1);
P + 1;
U = CEIL(SQRT(P));
DO J=2 TO U;
IF MOD(P,J) = 0 THEN LEAVE;
END;
END;
CALL SYMPUT('HSIZE',COMPRESS(PUT(P,32.)));
STOP;
SET &DATA NOBS=SIZE;
RUN;
%MEND HSIZE;
%MACRO HLOAD (DATA=, KEY=);
%LOCAL E R T;
%LET E = E%SUBSTR(%SYSFUNC(RANUNI(0)),3,7);
%LET R = R%SUBSTR(%SYSFUNC(RANUNI(0)),3,7);
%LET T = T%SUBSTR(%SYSFUNC(RANUNI(0)),3,7);
%GLOBAL H I L Z;
%LET H = H%SUBSTR(%SYSFUNC(RANUNI(0)),3,7);
%LET I = I%SUBSTR(%SYSFUNC(RANUNI(0)),3,7);
%LET L = L%SUBSTR(%SYSFUNC(RANUNI(0)),3,7);
%LET Z = Z%SUBSTR(%SYSFUNC(RANUNI(0)),3,7);
DROP &I &R;
ARRAY &H (0:&HSIZE) _TEMPORARY_;
ARRAY &L (0:&HSIZE) _TEMPORARY_;
&R = &HSIZE;
DO UNTIL (&E);
SET &DATA (KEEP=&KEY) END=&E;
&I = MOD (&KEY, &HSIZE) + 1;
IF &H (&I) > . THEN DO;
&T:
IF &KEY = &H (&I) THEN CONTINUE;
IF &L (&I) NE 0 THEN DO;
&I = &L (&I);
GOTO &T;
END;
DO WHILE ( &H (&R) > .);
&R +- 1;
END;
&L (&I) = &R;
&I = &R;
END;
&H (&I) = &KEY;
&L (&I) = 0;
END;
%MEND HLOAD;
%MACRO HSEARCH (KEY=, FOUND=FOUND);
DROP &FOUND;
&FOUND = .;
&I = MOD (&KEY, &HSIZE) + 1;
IF &H (&I) > . THEN DO;
&Z:
IF &H (&I) = &KEY THEN &FOUND = 1;
ELSE IF &L (&I) NE 0 THEN DO;
&I = &L (&I);
GOTO &Z;
END;
END;
%MEND HSEARCH;
That having been done, the solution to the matching problem boils down to
the following:
%HSIZE (DATA=SMALL,LOAD=.5);
DATA MATCH;
%HLOAD (DATA=SMALL,KEY=SKEY);
DO UNTIL (END);
SET LARGE END=END;
%HSEARCH (KEY=LKEY, FOUND=MATCH);
IF MATCH THEN OUTPUT;
END;
RUN;
In this final shape, the technique was benchmarked against formatting for
&NSMALL = 1E4, 1E5, 4.5E5, 1E6, and 2E6 with load factors 0.5, 0.75, and 1.
The "strange" number 450,000 is simply the largest lookup table size that
System OS/390 permitted me to feed into a format before the memory supply
was shut off at about 65 MB, so the larger numbers of &NSMALL pertain to
hashing only. The formatting approach was tested in the following form:
*** FORMATTED MATCH ***;
PROC SORT DATA=SMALL OUT=FMT NODUPKEY; BY SKEY; RUN;
DATA FMT (DROP=SKEY);
FMTNAME = 'KEYS'; TYPE = 'N';
DO UNTIL(DEND);
SET FMT END=DEND;
START = SKEY; LABEL = ' ';
OUTPUT;
END;
LABEL = '1'; HLO = 'O';
OUTPUT;
RUN;
PROC FORMAT CNTLIN=FMT; RUN;
DATA MATCH;
SET LARGE;
IF PUT(LKEY,KEYS.) = ' ';
RUN;
The testing (done in batch) resulted in the following performance figures,
with CPU time in seconds and memory usage in kilobytes:
*** FORMATTING ***
10,000 100,000 450,000 1,000,000
----------- ------------ ------------- ------------
TIME MEM TIME MEM TIME MEM TIME MEM
------------------- ----- ------ ----- ------- ----- ------ -----
LOAD 1.16 4649 11.12 11321 50.65 38870 . .
------------------- ----- ------ ----- ------- ----- ------ -----
SEARCH 105.33 5166 136.61 16958 166.94 63579 . .
------------------- ------ ------- ------
RUN 106.49 147.73 217.59 .
-----------------------------------------------------------------
*** COALESCING LIST HASHING ***
10,000 100,000 450,000 1,000,000 2,000,000
LOAD ----------- ----------- ----------- ---------- -----------
FACTOR STAGE TIME MEM TIME MEM TIME MEM TIME MEM TIME MEM
------------- -----------------------------------------------------------
0.50 LOAD 0.10 3538 0.70 6346 3.11 17282 6.96 34474 14.23 65722
------ ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
SEARCH 49.90 3676 53.31 6514 55.02 17450 55.63 34642 56.65 65890
------ ----- ----- ----- ----- -----
RUN 50.00 54.01 58.13 62.59 70.88
------ ------ ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
0.75 LOAD 0.10 3434 0.77 5306 3.39 12494 7.55 24058 15.21 44890
------ ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
SEARCH 55.82 3602 59.65 5474 63.64 12762 65.16 24226 66.76 45058
------ ----- ----- ----- ----- -----
RUN 55.92 60.42 67.03 72.71 81.97
------ ------ ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1.00 LOAD 0.11 3386 0.84 4786 3.83 10250 8.38 18018 16.96 34474
------ ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
SEARCH 67.55 3554 72.49 4954 78.63 10418 80.71 18850 81.67 34642
------ ----- ----- ----- ----- -----
RUN 67.66 73.33 82.46 89.09 98.63
-------------------------------------------------------------------------
Based on these numbers, we can make a number of interesting conclusions:
*** The lower is the load factor, i.e. the sparser is the hash table, the
faster is the hash search, but it consumes proportionally more memory.
However, if the hash table is kept relatively sparse (with the load factor
less than 0.7), it possesses a fascinating property: Practically, LOOKUP
TIME DOES NOT DEPEND ON THE NUMBER OF KEYS IN THE LOOKUP TABLE. Indeed, for
the load factor 0.5, the search times range from 49.90 to 56.65 for 10000 <=
&NSMALL <= 2000000. In a separate test, I found out that all the difference
is due to the additional time spent on the memory allocation for the bigger
arrays. This property may explain the following facts.
*** Given the same amount of data to search and the same amount of memory to
spare, hashing gets the job done 2.5 to 3.7 times faster than formatting.
The larger is the set of lookup values, the greater the advantage.
Given the same period of time to complete the task, hashing is capable of
searching a lookup table containing at least 300 times more values than
formatting, the exact upper limit being determined by the available memory.
In fact, with the load factor 0.5 (hash table is half full), hashing
searches 2-million set of values 33 percent faster (70.88 CPU seconds) than
formatting does mere 10 thousand (106.49).
Given the same amount of memory, even the emptiest (i.e. the fastest) hash
table above (with load factor 0.5) can accommodate 4.4 times more entries
than the fullest format. Conversely, the same amount of search data can be
processed by hashing using significantly less memory than by formatting. Or,
if the memory is at stake, one may decide to sacrifice a little bit of
performance by setting the load factor to a higher value and still be able
to create a hash table with the desired number if items.
*** Hashing uses virtually the same amount of memory in both load and lookup
phases, since both stages reside in the same DATA step. Formatting consumes
significantly more memory (11 to 63 percent) being on call than while being
loaded, the gap increasing along with the number of search keys. So, even if
the format has compiled successfully, it by no means guarantees that it will
not run out of memory while being used for a subsequent search.
*** Formatting requires removing duplicate keys from a CNTLIN= dataset,
which is normally done by sorting. Contrary to that, none of the files have
to be sorted before hashing. Should a duplicate key occur, it is eliminated
automatically in the process of loading the hash table.
Before the "kind regards", two questions remain.
1) What if there is an extra variable XVAR in the dataset SMALL that you
would like to drag along in the matching process? First, the datasets SMALL
and MATCH, being "small", could be simply sorted and merged after
subsetting. Second, if the memory is plentiful, XVAR can also be added to
the hash table by augmenting the latter with another temporary array of the
proper type and length, say, HXTR(0:&HSIZE), and populating it with XVAR
values right after the line inserting SKEY. For instance, if XVAR is a
4-character string, then:
.....
ARRAY HXTR (0 : &HSIZE) $4 _TEMPORARY_;
.....
HKEY (H) = SKEY; *** Insert key;
HXTR (H) = XVAR; *** Insert extra variable;
.....
In the searching portion of the code, the line
IF HKEY(H) = LKEY THEN FOUND = 1;
will, naturally, have to be replaced by
IF HKEY(H) = LKEY THEN DO;
FOUND = 1;
XVAR = HXTR (H);
END;
2) What if SKEY and LKEY are not non-negative integers? If they are signed
floating-point numbers, the answer is obvious: convert them to integers by
rescaling and shifting as necessary, and plug that expression into the MOD
function instead of the corresponding key. In SAS, it works fast. If the
keys are strings, especially long ones, the situation is less attractive.
However, the enormous performance advantage offered by hashing may make the
efforts to convert the strings to integers worthwhile. Sometimes, it is
simple. Don Stanley has mentioned that he has his keys in the form
A83424652, where the first byte is A to Z; he converts the latter to an
integer by using the RANK function. In principle, this operation can be
applied to every non-digital byte of the key. If keys are less than 16 bytes
long, it might be faster, though, to convert them to integers through the
hexadecimal representation, that is, replace the key in the MOD function by
the expression
INPUT(PUT(KEY,$HEX16.),HEX16.).
It should not overflow if the bytes in the string do not concentrate in the
high digits of the 256 radix, and usually, they do not. Don's method is not
fully protected from the overflow, either. However, overflows can be dealt
with, anyway. A hint: All we actually need is not the whole number, but only
the remainder modulo &HSIZE. Anyone?
Conclusion: A fifty or so lines of DATA step code seems like a pretty cheap
price for being able to subset 10 million records by 2 million in about a
minute, in all.
Happy Holiday, everyone!
Kind regards,
Paul
++++++++++++++++++++++++++++++++
Paul M. Dorfman
Citibank Universal Card Services
Decision Support Systems
Jacksonville, FL
++++++++++++++++++++++++++++++++
|