LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (October 2009, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Tue, 20 Oct 2009 13:13:42 -0400
Reply-To:   Paul Dorfman <sashole@BELLSOUTH.NET>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Paul Dorfman <sashole@BELLSOUTH.NET>
Subject:   Re: can has hash manys?
Comments:   To: gzuckier@SNAIL-MAIL.NET

Gerald,

At the underlying software level, the situation is addressed in 9.2 allowing for multi-hashes, with corresponding methods for harvesting data attached to the same key. In earlier versions, one of which you are running, here are manual workarounds.

Every one of them is based first on creating a unique composite key by adding a discriminating enumerator to the main hash keys. The difference is how the duplicate key entries are populated and harvested. One approach I had developed for my talk at SF SUGI is to create an extra hash table with all main hash keys, each containing its number of occurrences as its data. Muthia has shown you in this thread how it is done.

A simpler approach (or more complex, depending on how to look at it) is to eschew the second hash table in favor of incrementing the discriminator key up by a unity until no match is found. Here is an example, but some sample data first:

data large ; do id = 1 to 11 ; output ; end ; run ;

data small ; input id data ; cards ; 1 1.1 1 1.2 3 3.1 3 3.2 3 3.3 5 5.1 7 7.1 7 7.2 11 11.1 ; run ;

Now if in the output you need to only keep the keys found in the small table (an equivalent of an equi-join), consider:

data combine ; if _n_ = 1 then do ; dcl hash h(ordered: "a") ; h.definekey ("id", "_n_") ; h.definedata ("data") ; h.definedone () ; do until (z) ; set small end = z ; do _n_ = 1 by 1 until (h.check() ne 0) ; end ; h.add() ; end ; end ;

set large ;

do _n_ = 1 by 1 while (h.find() = 0) ; output ; end ; run ;

Note how the empty loop construct

do _n_ = 1 by 1 until (h.check() ne 0) ; end ; h.add() ;

is used to populate the hash table and add the discriminator at the same time and then how the construct

do _n_ = 1 by 1 while (h.find() = 0) ; output ; end ;

is used for harvesting the data corresponding to each occurrence of a given matching main key in H.

If you need the keep all the keys from LARGE with missing values for DATA for the keys having no match in SMALL (an equivalent of a LARGE left join on SMALL), then the harvesting logic gets a bit more convoluted:

data combine ; if _n_ = 1 then do ; dcl hash h(ordered: "a") ; h.definekey ("id", "_n_") ; h.definedata ("data") ; h.definedone () ; do until (z) ; set small end = z ; do _n_ = 1 by 1 until (h.check() ne 0) ; end ; h.add() ; end ; end ;

set large ;

call missing (data) ; do _n_ = 1 by 1 ; if h.find() ne 0 then do ; if _n_ = 1 then output ; leave ; end ; output ; end ; run ;

Of course, if you have more than one data variable in SMALL, you will have to include all those variables in call missing routine's argument list. If the list is quite long, it is somewhat inconvenient and amounts to a good deal of hard coding. In such cases, I usually clone a 1-row data set with all null values after SMALL, such as:

data small_null ; output ; stop ; set small (drop = ID) ; run ;

and then instead of CALL MISSING, simply code:

p = 1 ; set small_null point = p ;

which auto-nulls all host data variables coming from SMALL.

The extra hash table approach is apparently more efficient in the hash load cycle, especially if the hash is very large and contains lots of duplicate keys. The trade-off, especially in the latter case, is extra memory usage.

And of course, as Sigurd has already duly noted, for an equi-join using SQL with enough buffer size, say buffersize=1000000, is less fuzzy and will likely work faster via the invocation of the inner SQL hash method sqxjhsh.

Kind regards ------------ Paul Dorfman Jax, FL ------------

On Sun, 18 Oct 2009 18:39:58 -0700, z <gzuckier@SNAIL-MAIL.NET> wrote:

>I'm new to hash functions, so maybe I'm missing something. >I have a big matching: a table of 90 million transactions and a table >of 1,000 members. Each member in the little table has at least one >transaction; likely, more than one. The task, of course, is to pull >out all transactions related to the thousand members, matching on >member_id. I'm testing it just pulling the transaction date. Simple >enough, but the trouble is that it runs over 24 hours on our system >without reaching an endpoint, even using an index on the big table. > >Now the fun: using a hash runs in like 10 minutes; but it only >retrieves precisely one transaction date for each of the members, and >there have to be members with more than one transaction. > >Anyway, here's the code, slightly streamlined, if anybody can tell me >where i'm going wrong: > > >Data Results(Drop=Rc); >/* Define the variables to be used in the hash table declaration */ >format >Mbr_id $17. >event_date ddmmyy8.; > >/* Build the hash table */ >If _N_ = 1 Then >Do; >Declare Hash MbrLkup(HashExp:16,Dataset:'transactions'); >MbrLkup.DefineKey('Mbr_id'); >MbrLkup.DefineData('event_date'); >MbrLkup.DefineDone(); >Call Missing(Mbr_Id,event_date); >End; >Set testguys2009; >rc=mbrlkup.find(key:mbr_id); > if rc=0 then output; >run; > > >for output, i get a nice table of one thousand mbr_ids with one >event_date per member, when I'm expecting more than one. > >this is sas 9.1.3 in AIX, by the way. > >TIA.


Back to: Top of message | Previous page | Main SAS-L page