Date: Wed, 20 Dec 2000 21:47:32 -0000 sashole@bellsouth.net "SAS(r) Discussion" Paul Dorfman Re: Q: How to efficiently look up and classify permutations? To: koen.vyverman@FID-INTL.COM text/plain; format=flowed

"Vyverman, Koen" <koen.vyverman@FID-INTL.COM> wrote, in part:

>Paul opened his bag of magic hashing tricks, and offered a piece of >code that makes a start at solving part 2 of the problem by listing >for each WORD (observation) the other WORDs having the same >rank-sorted >value, and stuffing these into extra variables. At >least, that's what I >understand the routine should do. I'm puzzled >however by the fact that the >input word-list (A, in Paul's code) >does not correspond to the output >word-list (the WORD varble on >FOUND). In fact, only a fraction of the >original set of words shows >up again in FOUND. Paul, if this is correct, >then I need to admit >that I don't understand the meaning of your set of >F-variables, and >how they relate to WORD in the output data set ...

Koen,

You understand the meaning of F-vars properly. The trouble is due to 2 errors: (1) my failure to erase the previous f-list before the next one is populated (talk about common errors in DP!) (2) using w(i) (wrong!) instead of the w(x(i,j)) (right!) for the root WORD:

do j=1 to hbound(h); *scroll thru hash array; do i=1 to &np; f(i) = ''; end; *CLEAN UP!!!; if x(0,j) then do i=1 to x(0,j); *occupied nodes only; n = 0; word = w(x(i,j)); *root sibling - NOT w(i)!!!; do k=1 to x(0,j); if i=k then continue; *skip the root; n ++ 1; f(n) = w(x(k,j)); *get next sibling; end; output; end; end;

In the version I posted yesterday, the CLEAN UP!!! line was missing, and of course w(i) produced garbage :-(. In order to avoid dirty patches, the entire correct version is given after the sig line.

Now as far as the 'magic' of hashing goes, there is none, it is just a fast way to store and search things. Here is a bit of write up on how this actually works in this case. Suppose the input comprises just 17 records, with a 5-byte WORD each:

data a; informat word \$5.; input word @@; cards; XYZWV AEBCD \\\\\ ECABD WVZXY KLOMN ZYXVW NMOLK ZVWYX PTQSR ONLKM ----- TQSRP CADBE VYWZX JIHGF BACDE ; run;

If I did not know the data ahead of the time, I would select HS as the first prime > 17*2, i.e. 37, but since I know that the number of unique *sorted* words is just 7, I will choose HS=17 to make the following more concise.

Since WL=5, I select the hash substring length (and the width of the corresponding PIB informat) at 3. It means that to hash a (sorted) word, I grab 3 consecutive bytes from its middle (starting at position HBEG=2 in this case).

Why middle? It deserves a good explanation. Of course, in this demo-5-byte-WORD situation, I could hash the entire thing using input(word,pib5.). If I hashed *unsorted* 51-byte WORDs, I would simply clip a HL-byte chunk from the beginning and convert it to a large number as input(word, pib&hl..). (The informat width hl=6 is guaranteed to work on all machines. Under NT 8 works as well. OS/390 accepts 7, and HP-UNIX only 6. At longer widths, the MOD function spits out garbage - a remainder greater than the divisor). However, with *sorted* words the situation is radically different. Because of the ordering, it is extremely likely that the first HL bytes will be the same, hashing the majority of keys to the same node, producing an unacceptable number of collisions, and thus effectively rendering the hashing process serial-search degenerate. Surely the best way out would be to pick HL characters from anywhere in WORD randomly, string them together, and PIB the result. It can be proven that such *universal* hash function results in the lowest number of collisions *regardless* of the input. However, in SAS, substringing and concatenation are slow. By selecting HL *consecutive* bytes from the middle, I pay a little of the hash function's uniformity to buy a faster hash function. And it is programmatically simpler to boot.

In other words, for the purposes of the demo, the parms are selected as follows:

%let wl = 5; * word length; %let nw = 17; * num of words in file; %let np = 5; * max num of permutes for a word; %let hs = 17; * hash table size; %let hl = 3; * hash substr length;

After loading, the array W represents an exact image of the input:

w(01) XYZWV w(02) AEBCD w(03) \\\\\ w(04) ECABD w(05) WVZXY w(06) KLOMN w(07) ZYXVW w(08) NMOLK w(09) ZVWYX w(10) PTQSR w(11) ONLKM w(12) ----- w(13) TQSRP w(14) CADBE w(15) VYWZX w(16) JIHGF w(17) BACDE

At the same time as W is loaded, each word is sorted by the link subroutine, hashed and stored in the hash table H. Each time a duplicate *sorted* entry is found it obviously comes from a sibling word. The counter in the respective j-dimension of the array X(0,j) gets incremented, and the number poining to the word in the array W is thus stored in the next occurrence of the respective j-dimension of X. (Of course, the words themselves could be stored, too, and then the array W would not even be needed. With WL=5, it would be even more economical. But in the real world, with WL=51, storing say 10 200003-element numeric, i.e., 8-byte pointer rows in memory is better than doing the same with 51 actual data bytes per element, even at the expense of the array W.) Here is how the loaded H and parallel X tables actually look like (actual word values from W ar given next to the pointers into W in parentheses):

X(0,j) <------------ X(1-5, j) ----------------> H(01)= . . . . . . H(02)= . . . . . . H(03)= . . . . . . H(04)= . . . . . . H(05)=\\\\\ 1 3(\\\\\) H(06)= . . . . . . H(07)= . . . . . . H(08)= . . . . . . H(09)=PQRST 2 10(PTQSR) 13(TQSRP) . . . H(10)=VWXYZ 5 1(XYZWV) 5(WVZXY) 7(ZYXVW) 9(ZVWYX) 15(VYWZX) H(11)=KLMNO 3 6(KLOMN) 8(NMOLK) 11(ONLKM) H(12)= . . . . . . H(13)=FGHIJ 1 16(JIHGF) . . . . H(14)= . . . . . . H(15)=ABCDE 4 2(AEBCD) 4(ECABD) 14(CADBE) 17(BACDE) H(16)= . . . . . . H(17)=----- 1 12(-----)

Note that a missing counter value X(0,j) corresponds to the absence of a sorted key in the hash table H, which is why I use X(0,j) to check for an empty node in the hashing algorithm

do j=1+mod(input(substr(word,hbeg),pib&hl..),&hs) by +1 until (x(0,j)=. or h(j)=word); if j > &hs then j = 1; end; if x(0,j) = . then h(j) = word;

I use x(0,j)=. rather than h(j) = '' here and in the rest of the program because a numeric comparison is much faster than a 51-byte non-blank character comparison; plus, this way, blanks in the input (should any happen) need not be specially treated.

The table being ready, we only need to recycle the word lists in the non-empty nodes. Let us, for example, look at

X(0,15) X(1,15) X(2,15) X(3,15) X(4,15) --------------------------------------------------------- H(15)=ABCDE 4 2(AEBCD) 4(ECABD) 14(CADBE) 17(BACDE)

By the nature of the algorithm, *all* these words (corresponding to the sorted "common denominator" ABCDE) occur in the separate records (2, 4, 14, and 17 respectively) of the input file (check the array W to see it plainly). So, for AEBCD from the 2nd input record, we need to assign it as the actual value of WORD (thus making it the recycle root), and scroll through the rest of the siblings sticking them into the occurrences of F(n) one by one. Then ECABD becomes the root WORD, and the array is scanned again to pick up the rest. The condition

if i = k then continue;

ensures that the root itself, already assigned to WORD, is skipped. As a result, we have exactly as many output records as we had them in the input, only the F-siblings end up attached to the root words:

WORD F1 F2 F3 F4 F5 =============================================== \\\\\ PTQSR TQSRP TQSRP PTQSR XYZWV WVZXY ZYXVW ZVWYX VYWZX WVZXY XYZWV ZYXVW ZVWYX VYWZX ZYXVW XYZWV WVZXY ZVWYX VYWZX ZVWYX XYZWV WVZXY ZYXVW VYWZX VYWZX XYZWV WVZXY ZYXVW ZVWYX KLOMN NMOLK ONLKM NMOLK KLOMN ONLKM ONLKM KLOMN NMOLK JIHGF AEBCD ECABD CADBE BACDE ECABD AEBCD CADBE BACDE CADBE AEBCD ECABD BACDE BACDE AEBCD ECABD CADBE ----- ==============================================

Because of the hashing nature, the records do NOT repeat their input sequence but the sibling records end up being grouped together.

Kind regards, ======================== Paul M. Dorfman Jacksonville, Fl ========================

%let lcase = %lowcase(abcdefghijklmnopqrstuvwxyz); %let ucase = %upcase (abcdefghijklmnopqrstuvwxyz); %let chset = &lcase.&ucase.%str (%'/-\); %let lch = %length(&chset);

%let wl = 51; * word length; %let nw = 100000; * num of words in file; %let np = 10; * max num of permutes for a word;

%let hs = 200003; * hash table size; %let hl = 6; * hash substr length;

* test file; data a (keep=word); length word \$&wl lc rc \$1; do until (n = &nw); do i=1 to &wl; substr(word,i,1) = substr("&chset", ceil(ranuni(1)*&lch),1); end; do j=1 to ceil(ranuni(1) * &np) until (n = &nw); l = ceil(ranuni(1) * &wl); r = ceil(ranuni(1) * &wl); lc = substr(word,l); rc = substr(word,r); substr(word,l,1) = rc; substr(word,r,1) = lc; output; n ++ 1; end; end; run;

data found (keep=word f:); array w ( &nw) \$ &wl _temporary_; array h ( &hs) \$ &wl _temporary_; array x (0:&np, &hs) _temporary_; * x(0,j)=j-count; array f ( &np) \$ &wl;

do n=1 by +1 until (eof); set a end=eof; w(n) = word; * store word in array W;

link wordsort; * sort word in place;

do j=1+mod(input(substr(word,hbeg),pib&hl..),&hs) by +1 until (x(0,j)=. or h(j)=word); * hash; if j > &hs then j = 1; end; if x(0,j) = . then h(j) = word; * store word as key;

x(0,j) ++ 1; * enumerate next entry in list; x(x(0,j),j) = n; * store pointer into W in list; end;

do j=1 to hbound(h); * scroll thru hash array; do i=1 to &np; f(i) = ''; end; *CLEAN UP!!!; if x(0,j) then do i=1 to x(0,j); * occupied nodes only; n = 0; word = w(x(i,j)); * root sibling; do k=1 to x(0,j); if i=k then continue; * skip the root; n ++ 1; f(n) = w(x(k,j)); * get next sibling from W-array; end; output; end; end; stop;

init_rtn: * compute rank end-points; array z (0:255) _temporary_; lr = 255; hr = 0; do j=1 to &lch; lr = min (lr, rank(substr("&chset",j))); hr = max (hr, rank(substr("&chset",j))); end; * hash substr will begin at; hbeg = 1 + floor(&wl-&hl)/2; return;

wordsort: * sort word by key-indexing; do j=1 to &wl; z(rank(substr(word,j))) ++ 1; end; i = 0; do j=lr to hr; do while (z(j)); i ++ 1; z(j) +- 1; substr(word,i,1) = byte(j); end; end; return; run;