LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (April 2002, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Tue, 9 Apr 2002 17:02:01 -0400
Reply-To:   "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   "Dorfman, Paul" <Paul.Dorfman@BCBSFL.COM>
Subject:   Re: Key-indexing, hash, and MORE (was: a interesting question).
Comments:   To: "Cassell.David@EPAMAIL.EPA.GOV" <Cassell.David@EPAMAIL.EPA.GOV>
Content-Type:   text/plain; charset=iso-8859-1

> From: David L. Cassell [mailto:Cassell.David@EPAMAIL.EPA.GOV] > > Paul Dorfman wrote in part: > > 47 data hash_int (drop = _:) ; > > 48 retain _key 0 ; > > 49 declare associativearray hh (hashexp: 10) ; > > 50 _rc = hh.defineKey ('_key') ; > > 51 _rc = hh.defineData ('_key') ; > > 52 _rc = hh.defineDone ( ) ; > > 53 > > 54 do p = nobs to 1 by -1 ; > > 55 set w point = p nobs = nobs ; > > 56 _key = b ; > > 57 if hh.find () ne 0 then output ; > > 58 _key = a ; > > 59 _rc = hh.add () ; > > 60 end ; > > 61 stop ; > > 62 run ; > > Ooh! Ooh! Now I can't wait for the V9 presentation!

I understand.

> Seriously, how is the speed of the lookup on the associative > arrays?

From what I have seen so far (and I have not seen too much yet), the speed is about 50-70 per cent of what can be achieved with hand-coded hash for a *numeric* key. Have not tried a long character key yet (not all at once, plus work happens), but suspect that the speed of AA will be same of even higher than that of hand-coding.

> Does it look like O(1)?

Yes.

> How much space do they allocate for each bucket?

A good question. To begin with, the above-inidcated relative speed is achieved achieved at the level of memory usage almost 100 times lower than that of the hand-coding hash - and that without satellites. For example, the code above used only 134K working with 1,500,000 keys, as opposed to the hand-coded 13M. As you realize, it can only happen if not all keys at once are memory-resident. Combining a line of information from a good birdie with the results I have seen so far leads me to believe that AA operates in the following manner:

1) AA declaration establishes table structure taking into account the combined key/data length and number of buckets (internal, depends on HASHEXP specification)

2) Insertion: Hash the key with the result pointing to a bucket organised as an AVL tree Ascend the bucket to memory Search the tree If found, set RC>0 for AA.ADD and terminate Insert the key Rebalance the tree if needed Descend the bucket

2) Search: Hash the key with the result pointing to a bucket organised as an AVL tree Ascend the bucket to memory Search the tree If not found, set RC>0 for AA.FIND and terminate Descend the bucket

> Do you have to pre-guess the number of buckets,

Rather, you have to preguess the value of HASHEXP. AA will work with any, but you cannot go wrong by specifying more - it does not cause memory usage to grow beyond a certain threshold. For instance, in the code above, changing HASHEXP from 10 to 20 did not change a thing, either in the speed or memory department, while decreasing it to 5 cut both speed and memory usage in half.

> or does it expand the associative array as you add elements?

I think that the number of buckets is fixed by HASHEXP, but they are allowed to overflow (equivalent of having the load factor > 1).

> How well does it split the elements among buckets?

How can I tell directly? Indirectly, I can only run a test against a severely non-uniform key distribution and see how well it performs.

> Inquiring minds want to know,

To what they already know, I should add that I do not like the formal term 'associative array'. Just like in Perl, folks will call it simply hash. To me,

declare hash hh (hashexp: 10) ;

feels much more natural than

declare associativearray hh (hashexp: 10) ;

And it would be consistent with the keyword HASHEXP as well. Plus, I would like to have a lever other than HASHEXP (or maybe add this functionality to HASHEXP) that would allow a user to trade speed for memory. I reckon no one would object if by specifying, say, HASHMEM=13M, one could search several times faster than with the default 134K.

Kind regards, =================== Paul M. Dorfman Jacksonville, FL ===================

Blue Cross Blue Shield of Florida, Inc., and its subsidiary and affiliate companies are not responsible for errors or omissions in this e-mail message. Any personal comments made in this e-mail do not reflect the views of Blue Cross Blue Shield of Florida, Inc.


Back to: Top of message | Previous page | Main SAS-L page