| Date: | Tue, 20 Oct 2009 13:13:42 -0400 |
| Reply-To: | Paul Dorfman <sashole@BELLSOUTH.NET> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Paul Dorfman <sashole@BELLSOUTH.NET> |
| Subject: | Re: can has hash manys? |
|
Gerald,
At the underlying software level, the situation is addressed in 9.2 allowing
for multi-hashes, with corresponding methods for harvesting data attached to
the same key. In earlier versions, one of which you are running, here are
manual workarounds.
Every one of them is based first on creating a unique composite key by
adding a discriminating enumerator to the main hash keys. The difference is
how the duplicate key entries are populated and harvested. One approach I
had developed for my talk at SF SUGI is to create an extra hash table with
all main hash keys, each containing its number of occurrences as its data.
Muthia has shown you in this thread how it is done.
A simpler approach (or more complex, depending on how to look at it) is to
eschew the second hash table in favor of incrementing the discriminator key
up by a unity until no match is found. Here is an example, but some sample
data first:
data large ;
do id = 1 to 11 ;
output ;
end ;
run ;
data small ;
input id data ;
cards ;
1 1.1
1 1.2
3 3.1
3 3.2
3 3.3
5 5.1
7 7.1
7 7.2
11 11.1
;
run ;
Now if in the output you need to only keep the keys found in the small table
(an equivalent of an equi-join), consider:
data combine ;
if _n_ = 1 then do ;
dcl hash h(ordered: "a") ;
h.definekey ("id", "_n_") ;
h.definedata ("data") ;
h.definedone () ;
do until (z) ;
set small end = z ;
do _n_ = 1 by 1 until (h.check() ne 0) ;
end ;
h.add() ;
end ;
end ;
set large ;
do _n_ = 1 by 1 while (h.find() = 0) ;
output ;
end ;
run ;
Note how the empty loop construct
do _n_ = 1 by 1 until (h.check() ne 0) ;
end ;
h.add() ;
is used to populate the hash table and add the discriminator at the same
time and then how the construct
do _n_ = 1 by 1 while (h.find() = 0) ;
output ;
end ;
is used for harvesting the data corresponding to each occurrence of a given
matching main key in H.
If you need the keep all the keys from LARGE with missing values for DATA
for the keys having no match in SMALL (an equivalent of a LARGE left join on
SMALL), then the harvesting logic gets a bit more convoluted:
data combine ;
if _n_ = 1 then do ;
dcl hash h(ordered: "a") ;
h.definekey ("id", "_n_") ;
h.definedata ("data") ;
h.definedone () ;
do until (z) ;
set small end = z ;
do _n_ = 1 by 1 until (h.check() ne 0) ;
end ;
h.add() ;
end ;
end ;
set large ;
call missing (data) ;
do _n_ = 1 by 1 ;
if h.find() ne 0 then do ;
if _n_ = 1 then output ;
leave ;
end ;
output ;
end ;
run ;
Of course, if you have more than one data variable in SMALL, you will have
to include all those variables in call missing routine's argument list. If
the list is quite long, it is somewhat inconvenient and amounts to a good
deal of hard coding. In such cases, I usually clone a 1-row data set with
all null values after SMALL, such as:
data small_null ;
output ;
stop ;
set small (drop = ID) ;
run ;
and then instead of CALL MISSING, simply code:
p = 1 ; set small_null point = p ;
which auto-nulls all host data variables coming from SMALL.
The extra hash table approach is apparently more efficient in the hash load
cycle, especially if the hash is very large and contains lots of duplicate
keys. The trade-off, especially in the latter case, is extra memory usage.
And of course, as Sigurd has already duly noted, for an equi-join using SQL
with enough buffer size, say buffersize=1000000, is less fuzzy and will
likely work faster via the invocation of the inner SQL hash method sqxjhsh.
Kind regards
------------
Paul Dorfman
Jax, FL
------------
On Sun, 18 Oct 2009 18:39:58 -0700, z <gzuckier@SNAIL-MAIL.NET> wrote:
>I'm new to hash functions, so maybe I'm missing something.
>I have a big matching: a table of 90 million transactions and a table
>of 1,000 members. Each member in the little table has at least one
>transaction; likely, more than one. The task, of course, is to pull
>out all transactions related to the thousand members, matching on
>member_id. I'm testing it just pulling the transaction date. Simple
>enough, but the trouble is that it runs over 24 hours on our system
>without reaching an endpoint, even using an index on the big table.
>
>Now the fun: using a hash runs in like 10 minutes; but it only
>retrieves precisely one transaction date for each of the members, and
>there have to be members with more than one transaction.
>
>Anyway, here's the code, slightly streamlined, if anybody can tell me
>where i'm going wrong:
>
>
>Data Results(Drop=Rc);
>/* Define the variables to be used in the hash table declaration */
>format
>Mbr_id $17.
>event_date ddmmyy8.;
>
>/* Build the hash table */
>If _N_ = 1 Then
>Do;
>Declare Hash MbrLkup(HashExp:16,Dataset:'transactions');
>MbrLkup.DefineKey('Mbr_id');
>MbrLkup.DefineData('event_date');
>MbrLkup.DefineDone();
>Call Missing(Mbr_Id,event_date);
>End;
>Set testguys2009;
>rc=mbrlkup.find(key:mbr_id);
> if rc=0 then output;
>run;
>
>
>for output, i get a nice table of one thousand mbr_ids with one
>event_date per member, when I'm expecting more than one.
>
>this is sas 9.1.3 in AIX, by the way.
>
>TIA.
|