|
> From: David L. Cassell [mailto:Cassell.David@EPAMAIL.EPA.GOV]
>
> Paul Dorfman wrote in part:
> > 47 data hash_int (drop = _:) ;
> > 48 retain _key 0 ;
> > 49 declare associativearray hh (hashexp: 10) ;
> > 50 _rc = hh.defineKey ('_key') ;
> > 51 _rc = hh.defineData ('_key') ;
> > 52 _rc = hh.defineDone ( ) ;
> > 53
> > 54 do p = nobs to 1 by -1 ;
> > 55 set w point = p nobs = nobs ;
> > 56 _key = b ;
> > 57 if hh.find () ne 0 then output ;
> > 58 _key = a ;
> > 59 _rc = hh.add () ;
> > 60 end ;
> > 61 stop ;
> > 62 run ;
>
> Ooh! Ooh! Now I can't wait for the V9 presentation!
I understand.
> Seriously, how is the speed of the lookup on the associative
> arrays?
From what I have seen so far (and I have not seen too much yet), the speed
is about 50-70 per cent of what can be achieved with hand-coded hash for a
*numeric* key. Have not tried a long character key yet (not all at once,
plus work happens), but suspect that the speed of AA will be same of even
higher than that of hand-coding.
> Does it look like O(1)?
Yes.
> How much space do they allocate for each bucket?
A good question. To begin with, the above-inidcated relative speed is
achieved achieved at the level of memory usage almost 100 times lower than
that of the hand-coding hash - and that without satellites. For example, the
code above used only 134K working with 1,500,000 keys, as opposed to the
hand-coded 13M. As you realize, it can only happen if not all keys at once
are memory-resident.
Combining a line of information from a good birdie with the results I have
seen so far leads me to believe that AA operates in the following manner:
1) AA declaration establishes table structure taking into account the
combined key/data length and number of buckets (internal, depends on HASHEXP
specification)
2) Insertion:
Hash the key with the result pointing to a bucket organised as an AVL tree
Ascend the bucket to memory
Search the tree
If found, set RC>0 for AA.ADD and terminate
Insert the key
Rebalance the tree if needed
Descend the bucket
2) Search:
Hash the key with the result pointing to a bucket organised as an AVL tree
Ascend the bucket to memory
Search the tree
If not found, set RC>0 for AA.FIND and terminate
Descend the bucket
> Do you have to pre-guess the number of buckets,
Rather, you have to preguess the value of HASHEXP. AA will work with any,
but you cannot go wrong by specifying more - it does not cause memory usage
to grow beyond a certain threshold. For instance, in the code above,
changing HASHEXP from 10 to 20 did not change a thing, either in the speed
or memory department, while decreasing it to 5 cut both speed and memory
usage in half.
> or does it expand the associative array as you add elements?
I think that the number of buckets is fixed by HASHEXP, but they are allowed
to overflow (equivalent of having the load factor > 1).
> How well does it split the elements among buckets?
How can I tell directly? Indirectly, I can only run a test against a
severely non-uniform key distribution and see how well it performs.
> Inquiring minds want to know,
To what they already know, I should add that I do not like the formal term
'associative array'. Just
like in Perl, folks will call it simply hash. To me,
declare hash hh (hashexp: 10) ;
feels much more natural than
declare associativearray hh (hashexp: 10) ;
And it would be consistent with the keyword HASHEXP as well. Plus, I would
like to have a lever other than HASHEXP (or maybe add this functionality to
HASHEXP) that would allow a user to trade speed for memory. I reckon no one
would object if by specifying, say, HASHMEM=13M, one could search several
times faster than with the default 134K.
Kind regards,
===================
Paul M. Dorfman
Jacksonville, FL
===================
Blue Cross Blue Shield of Florida, Inc., and its subsidiary and
affiliate companies are not responsible for errors or omissions in this e-mail message. Any personal comments made in this e-mail do not reflect the views of Blue Cross Blue Shield of Florida, Inc.
|