LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (December 1998, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 15 Dec 1998 21:15:40 -0500
Reply-To:     pdorfma@FL6612MAILEX4.UCS.ATT.COM
Sender:       "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From:         pdorfma@FL6612MAILEX4.UCS.ATT.COM
Subject:      XMAS SASTip: Quick Table Lookup by Hashing
Comments: To: "don_stanley@ibm.net" <don_stanley@ibm.net>
Content-Type: text/plain; charset="iso-8859-1"

Dear SAS-Lers,

It is about time to exchange SAS gifts. Here is mine to you.

Consider the following (toooo very common) problem:

One file, SMALL, contains a variable SKEY. Another file, LARGE, contains a variable LKEY and maybe other variables, for instance, SMTHELSE. Within the limits of SAS, what is the most efficient way to match SMALL and LARGE by SKEY and LKEY?

For certainty, assume that the number of records in LARGE is &N_LARGE = 10,000,000 and that the number of records in SMALL, &N_SMALL, may vary from 1,000 to 2,000,000. Let us also limit the case to KEYs being non-negative integers. (In actuality, it is not a limitation, since any key, numeric or character, can be converted to a natural number. We will discuss it at the end). The situation can be simulated, for instance, like that:

%LET NSMALL = 1E5; %LET NLARGE = 1E7; %LET NRANGE = 1E9;

DATA SMALL (KEEP=SKEY) LARGE (KEEP=LKEY SMTHELSE); DO I=1 TO &NSMALL; SKEY = CEIL(RANUNI(1)*&NRANGE); OUTPUT SMALL; END; DO I=1 TO &NLARGE; LKEY = CEIL(RANUNI(2)*&NRANGE); SMTHELSE = 'SMTHELSE'; OUTPUT LARGE; END; RUN;

Hence, KEY can assume all values from 0 to &NRANGE. If we reject methods not based on storing a lookup object entirely in memory, it will leave us with either using a format loaded with the distinct keys from SMALL, or a binary search on a sorted array loaded with the keys from SMALL. Binary search is faster in the load phase and easier on memory, but it is about two times slower in the lookup phase. Is there a method that would search faster than format search, load on par with binary search, and be good enough on memory to accommodate a lookup table with 2 million plus entries?

The answer is YES, and it is based on extending the following idea. Imagine that KEY contains no more than 5 digits, that is &NRANGE is 1E5. In this case, we can allocate a temporary array with 100000 entries and index it with the KEY values from SMALL. Then, every time a KEY value LARGE is read, we will immediately know whether the search was successful or not by simply examining the bucket whose number is the KEY value:

DATA MATCH; ARRAY HKEY (0:100000) _TEMPORARY_; DO UNTIL (SEND); SET SMALL END=SEND; HKEY (SKEY) = 1; END; DO UNTIL (LEND); SET LARGE END=LEND; IF HKEY (LKEY) THEN OUTPUT; END; RUN;

No other method can beat the simplicity and speed of this, "key-indexed", search in its domain of applicability: it makes no more than 1 comparison per search, success or failure. Too bad it is limited to short keys only! Indeed, a 9-digit key, for example, would require creating a 1-billion array which is practically impossible; besides, most of its space would be wasted.

Fortunately, there is a way around this problem. Let us allocate a temporary array HKEY(0:&HSIZE) with &HSIZE equal to some prime number greater than &NSMALL, and call the ratio &NSMALL/&HSIZE "load factor". If we then divide each SKEY from SMALL by &HSIZE, the remainder, i.e. H = MOD(SKEY,&HSIZE), will only range from 0 to &HSIZE-1. Now, we can try to use the value of H to index array HKEY, in other words, to HASH keys to only as many as &HSIZE locations. It is VERY probable that some keys will yield the same remainder and, therefore, hash to the same HKEY(H) address, and it is not too good. But as &HSIZE is a prime number, it is VERY improbable that many keys will hash to one particular location, and it is very good! We will have to figure out, though, how to resolve these collisions. One way would be to store a colliding key in some empty node of the array HKEY and create a link pointing to the node whence the key would have come if the node had been empty. Thus, we will maintain &HSIZE linked lists, one for each possible hash code H, with all lookup keys residing entirely inside the array HKEY; another array, LINK(0:&HSIZE) will be needed to store the links. Then, if the file LARGE supplies an item LKEY to search, the latter could be carried out as follows:

1. H = MOD(LKEY,&HSIZE). 2. If HKEY (H) is empty, the search failed. 3. Else do a serial search on the list whose head begins at H.

The AVERAGE list length is equal to load factor, that is, less than 1! Therefore, it could expected intuitively that making the table sparser will result in a faster search, with linear increase in the memory usage. Now all we have to do is express the above in the SAS language:

*** Pick some load factor less than 1 ;

%LET LOAD = .5;

*** Find first prime number greater than lookup; *** dataset size divided by load factor;

DATA _NULL_; P = CEIL(SIZE / &LOAD); DO UNTIL (J = U + 1); P + 1; U = CEIL(SQRT(P)); DO J=2 TO U; IF MOD(P,J) = 0 THEN LEAVE; END; END; CALL SYMPUT('HSIZE',LEFT(PUT(P,BEST.))); STOP; SET SMALL NOBS=SIZE; RUN;

*** Matching by coalesced list hashing;

DATA MATCH (KEEP=KEY SMTHELSE); ARRAY HKEY (0 : &HSIZE) _TEMPORARY_; *** Hash table; ARRAY LINK (0 : &HSIZE) _TEMPORARY_; *** Link table;

R = &HSIZE; *** Counter to help find empty nodes;

*** Load hash table; DO UNTIL (LEND); SET SMALL END=LEND; H = MOD(SKEY, &HSIZE) + 1; *** Hash the key; IF HKEY (H) > . THEN DO; *** If node not empty; SETLINK: IF SKEY = HKEY(H) THEN CONTINUE; *** duplicate key!; IF LINK(H) NE 0 THEN DO; H = LINK(H); GOTO SETLINK; END; DO WHILE (HKEY(R) > .); *** find empty node; R +-1; END; LINK(H) = R; H = R; END; HKEY (H) = SKEY; *** Insert key; LINK (H) = 0; *** and mark node as occupied; END;

*** Perform hash search; DO UNTIL (MEND); SET LARGE END=MEND; FOUND = .; H = MOD(LKEY, &HSIZE) + 1; *** hash;

*** if node not empty search list; IF HKEY (H) > . THEN DO; SRCHLIST: IF HKEY(H) = LKEY THEN FOUND = 1; ELSE IF LINK(H) NE 0 THEN DO; H = LINK(H); GOTO SRCHLIST; END; END; IF FOUND THEN OUTPUT; END; RUN;

An attentive SAS-Ler will note that the two arrays, HKEY and LINK, could be combined into one 2-dimensional array. As it results in a slightly slower execution, I left the things intact. However, I decided to make it easier on myself in the future by encapsulating the code into 3 autocall macros: one for determining the hash table size, one for loading the table, and one for looking it up.

%MACRO HSIZE (DATA=, LOAD=.5); %GLOBAL HSIZE; DATA _NULL_; P = CEIL(SIZE/&LOAD); DO UNTIL (J = U + 1); P + 1; U = CEIL(SQRT(P)); DO J=2 TO U; IF MOD(P,J) = 0 THEN LEAVE; END; END; CALL SYMPUT('HSIZE',COMPRESS(PUT(P,32.))); STOP; SET &DATA NOBS=SIZE; RUN; %MEND HSIZE;

%MACRO HLOAD (DATA=, KEY=); %LOCAL E R T; %LET E = E%SUBSTR(%SYSFUNC(RANUNI(0)),3,7); %LET R = R%SUBSTR(%SYSFUNC(RANUNI(0)),3,7); %LET T = T%SUBSTR(%SYSFUNC(RANUNI(0)),3,7); %GLOBAL H I L Z; %LET H = H%SUBSTR(%SYSFUNC(RANUNI(0)),3,7); %LET I = I%SUBSTR(%SYSFUNC(RANUNI(0)),3,7); %LET L = L%SUBSTR(%SYSFUNC(RANUNI(0)),3,7); %LET Z = Z%SUBSTR(%SYSFUNC(RANUNI(0)),3,7); DROP &I &R; ARRAY &H (0:&HSIZE) _TEMPORARY_; ARRAY &L (0:&HSIZE) _TEMPORARY_; &R = &HSIZE; DO UNTIL (&E); SET &DATA (KEEP=&KEY) END=&E; &I = MOD (&KEY, &HSIZE) + 1; IF &H (&I) > . THEN DO; &T: IF &KEY = &H (&I) THEN CONTINUE; IF &L (&I) NE 0 THEN DO; &I = &L (&I); GOTO &T; END; DO WHILE ( &H (&R) > .); &R +- 1; END; &L (&I) = &R; &I = &R; END; &H (&I) = &KEY; &L (&I) = 0; END; %MEND HLOAD;

%MACRO HSEARCH (KEY=, FOUND=FOUND); DROP &FOUND; &FOUND = .; &I = MOD (&KEY, &HSIZE) + 1; IF &H (&I) > . THEN DO; &Z: IF &H (&I) = &KEY THEN &FOUND = 1; ELSE IF &L (&I) NE 0 THEN DO; &I = &L (&I); GOTO &Z; END; END; %MEND HSEARCH;

That having been done, the solution to the matching problem boils down to the following:

%HSIZE (DATA=SMALL,LOAD=.5);

DATA MATCH; %HLOAD (DATA=SMALL,KEY=SKEY); DO UNTIL (END); SET LARGE END=END; %HSEARCH (KEY=LKEY, FOUND=MATCH); IF MATCH THEN OUTPUT; END; RUN;

In this final shape, the technique was benchmarked against formatting for &NSMALL = 1E4, 1E5, 4.5E5, 1E6, and 2E6 with load factors 0.5, 0.75, and 1. The "strange" number 450,000 is simply the largest lookup table size that System OS/390 permitted me to feed into a format before the memory supply was shut off at about 65 MB, so the larger numbers of &NSMALL pertain to hashing only. The formatting approach was tested in the following form:

*** FORMATTED MATCH ***;

PROC SORT DATA=SMALL OUT=FMT NODUPKEY; BY SKEY; RUN; DATA FMT (DROP=SKEY); FMTNAME = 'KEYS'; TYPE = 'N'; DO UNTIL(DEND); SET FMT END=DEND; START = SKEY; LABEL = ' '; OUTPUT; END; LABEL = '1'; HLO = 'O'; OUTPUT; RUN; PROC FORMAT CNTLIN=FMT; RUN; DATA MATCH; SET LARGE; IF PUT(LKEY,KEYS.) = ' '; RUN;

The testing (done in batch) resulted in the following performance figures, with CPU time in seconds and memory usage in kilobytes:

*** FORMATTING ***

10,000 100,000 450,000 1,000,000 ----------- ------------ ------------- ------------ TIME MEM TIME MEM TIME MEM TIME MEM ------------------- ----- ------ ----- ------- ----- ------ ----- LOAD 1.16 4649 11.12 11321 50.65 38870 . . ------------------- ----- ------ ----- ------- ----- ------ ----- SEARCH 105.33 5166 136.61 16958 166.94 63579 . . ------------------- ------ ------- ------ RUN 106.49 147.73 217.59 . -----------------------------------------------------------------

*** COALESCING LIST HASHING ***

10,000 100,000 450,000 1,000,000 2,000,000 LOAD ----------- ----------- ----------- ---------- ----------- FACTOR STAGE TIME MEM TIME MEM TIME MEM TIME MEM TIME MEM ------------- ----------------------------------------------------------- 0.50 LOAD 0.10 3538 0.70 6346 3.11 17282 6.96 34474 14.23 65722 ------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- SEARCH 49.90 3676 53.31 6514 55.02 17450 55.63 34642 56.65 65890 ------ ----- ----- ----- ----- ----- RUN 50.00 54.01 58.13 62.59 70.88 ------ ------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 0.75 LOAD 0.10 3434 0.77 5306 3.39 12494 7.55 24058 15.21 44890 ------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- SEARCH 55.82 3602 59.65 5474 63.64 12762 65.16 24226 66.76 45058 ------ ----- ----- ----- ----- ----- RUN 55.92 60.42 67.03 72.71 81.97 ------ ------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 1.00 LOAD 0.11 3386 0.84 4786 3.83 10250 8.38 18018 16.96 34474 ------ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- SEARCH 67.55 3554 72.49 4954 78.63 10418 80.71 18850 81.67 34642 ------ ----- ----- ----- ----- ----- RUN 67.66 73.33 82.46 89.09 98.63 -------------------------------------------------------------------------

Based on these numbers, we can make a number of interesting conclusions:

*** The lower is the load factor, i.e. the sparser is the hash table, the faster is the hash search, but it consumes proportionally more memory. However, if the hash table is kept relatively sparse (with the load factor less than 0.7), it possesses a fascinating property: Practically, LOOKUP TIME DOES NOT DEPEND ON THE NUMBER OF KEYS IN THE LOOKUP TABLE. Indeed, for the load factor 0.5, the search times range from 49.90 to 56.65 for 10000 <= &NSMALL <= 2000000. In a separate test, I found out that all the difference is due to the additional time spent on the memory allocation for the bigger arrays. This property may explain the following facts.

*** Given the same amount of data to search and the same amount of memory to spare, hashing gets the job done 2.5 to 3.7 times faster than formatting. The larger is the set of lookup values, the greater the advantage.

Given the same period of time to complete the task, hashing is capable of searching a lookup table containing at least 300 times more values than formatting, the exact upper limit being determined by the available memory. In fact, with the load factor 0.5 (hash table is half full), hashing searches 2-million set of values 33 percent faster (70.88 CPU seconds) than formatting does mere 10 thousand (106.49).

Given the same amount of memory, even the emptiest (i.e. the fastest) hash table above (with load factor 0.5) can accommodate 4.4 times more entries than the fullest format. Conversely, the same amount of search data can be processed by hashing using significantly less memory than by formatting. Or, if the memory is at stake, one may decide to sacrifice a little bit of performance by setting the load factor to a higher value and still be able to create a hash table with the desired number if items.

*** Hashing uses virtually the same amount of memory in both load and lookup phases, since both stages reside in the same DATA step. Formatting consumes significantly more memory (11 to 63 percent) being on call than while being loaded, the gap increasing along with the number of search keys. So, even if the format has compiled successfully, it by no means guarantees that it will not run out of memory while being used for a subsequent search.

*** Formatting requires removing duplicate keys from a CNTLIN= dataset, which is normally done by sorting. Contrary to that, none of the files have to be sorted before hashing. Should a duplicate key occur, it is eliminated automatically in the process of loading the hash table.

Before the "kind regards", two questions remain.

1) What if there is an extra variable XVAR in the dataset SMALL that you would like to drag along in the matching process? First, the datasets SMALL and MATCH, being "small", could be simply sorted and merged after subsetting. Second, if the memory is plentiful, XVAR can also be added to the hash table by augmenting the latter with another temporary array of the proper type and length, say, HXTR(0:&HSIZE), and populating it with XVAR values right after the line inserting SKEY. For instance, if XVAR is a 4-character string, then: ..... ARRAY HXTR (0 : &HSIZE) $4 _TEMPORARY_; ..... HKEY (H) = SKEY; *** Insert key; HXTR (H) = XVAR; *** Insert extra variable; .....

In the searching portion of the code, the line

IF HKEY(H) = LKEY THEN FOUND = 1;

will, naturally, have to be replaced by

IF HKEY(H) = LKEY THEN DO; FOUND = 1; XVAR = HXTR (H); END;

2) What if SKEY and LKEY are not non-negative integers? If they are signed floating-point numbers, the answer is obvious: convert them to integers by rescaling and shifting as necessary, and plug that expression into the MOD function instead of the corresponding key. In SAS, it works fast. If the keys are strings, especially long ones, the situation is less attractive. However, the enormous performance advantage offered by hashing may make the efforts to convert the strings to integers worthwhile. Sometimes, it is simple. Don Stanley has mentioned that he has his keys in the form A83424652, where the first byte is A to Z; he converts the latter to an integer by using the RANK function. In principle, this operation can be applied to every non-digital byte of the key. If keys are less than 16 bytes long, it might be faster, though, to convert them to integers through the hexadecimal representation, that is, replace the key in the MOD function by the expression

INPUT(PUT(KEY,$HEX16.),HEX16.).

It should not overflow if the bytes in the string do not concentrate in the high digits of the 256 radix, and usually, they do not. Don's method is not fully protected from the overflow, either. However, overflows can be dealt with, anyway. A hint: All we actually need is not the whole number, but only the remainder modulo &HSIZE. Anyone?

Conclusion: A fifty or so lines of DATA step code seems like a pretty cheap price for being able to subset 10 million records by 2 million in about a minute, in all.

Happy Holiday, everyone!

Kind regards, Paul

++++++++++++++++++++++++++++++++ Paul M. Dorfman Citibank Universal Card Services Decision Support Systems Jacksonville, FL ++++++++++++++++++++++++++++++++


Back to: Top of message | Previous page | Main SAS-L page