|Date: ||Fri, 29 Dec 2000 20:16:22 -0000|
|Sender: ||"SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>|
|From: ||Paul Dorfman <paul_dorfman@HOTMAIL.COM>|
|Subject: ||Re: sorting efficiency|
This is a topic good enough for a book...which is a long story, but to make
it short, here is the gist. "True" sortless file-matching methods ("untrue"
ones still sort behind the scenes) are usually based on the following
principle. You grab one of the files (usually the smaller one) and store its
keys in some kind of look-up table. Such a table can be organized in a
number of different ways. In particular, it can be an in-memory hash table.
From the standpoint of the DATA step, a hash table is simply a pretty sparse
temporary array (i.e. having more slots than the number of distinct keys
inserted into it). One of special hashing algorithms is used to insert the
keys in the table. A very similar algorithm can then be used to search the
table given a search key. Hashing algorithms are basically very simple and
can be easily coded in the DATA step (I have implemented at least four of
them; take a look at the attached paper if you are interested). Both hash
insertion and search are very fast; suffices it is say that they vastly
out-perform SAS formats.
If the file being hashed contains non-key variables (otherwise termed
"satellite information"), they can be also stored in memory - in arrays
parallel to the "key" hash table array. At the search time, if the search
key is found, the satellite information can be retrieved from the
corresponding array cell(s).
The look-up medium having been created, you can read the other file and for
each record, search for its key in the hash table. If the key is found, you
retrieve the satellites (if any), write the stuff out, or else do whatever
your processing needs dictate.
In the nutshell, that is it. Now to the central point of the post. Dragging
satellite(s) through the memory comes at the expense of rather steep
increase in memory usage - the more severe, the longer the satellites are.
That is why it seemed lucrative from the very beginning (harking precisely
to this very time in 1998, when all the hashing "monstrosity", as poster to
SAS-L christened it, broke loose) to handle satellites, especially long
ones, in a shrewder fashion. Namely, instead of storing all of them in
memory, one could use a single array parallel to the key hash table to store
the observation numbers as satellites. At the search time, if the key (for
instance, coming from the other file) has been found, we retrieve the
satellite into a variable P and unleash POINT=P to fetch the long satellites
from disk. Of course, the concept here is nothing more than old good
indexing,only instead of storing the index in a disk-based B-tree, it is
stored in a memory-resident hash table.
Using POINT= in such a manner raises the question of x-paging, since the
observations with the keys that have been found are retrieved in no
particular order with respect to the record number. For instance, in the
well-known binary search performed against a disk-resident SAS data file,
this presents a real danger. In principle, the phenomenon is capable of
hindering performance badly enough to render a method based on the POINT=
access practically inoperative. That is why, when I had tested it with the
hash indexing for the first time and achieved rather dismal performance, I
decided that x-paging was a non-conquerable culprit and wrote an e-mail to
that effect to Karsten Self. (Karsten was the first to grasp the potential
of hashing in SAS, and at the time we had been being on a month-long e-mail
hashing spree. The latter culminated in the message I received from Karsten
at work 1/2 hour before the end of 1998. Karsten had just hashed out -
literally - some mainframe file match problem and entitled the message "Hash
rocks, Dude!" - the motto I have been using in all my "hash" presentations
as the final slide ever since. Thanks, Dude!) It also seemed to be in sync
with my theoretical understanding of POINT='s pros and cons. However, it has
turned out in subsequent, much more recent, tests, that originally I somehow
got it wrong, and in fact, POINT= works just fine for the purpose. Yes, the
observations are fetched out of order, but, unlike in the case of binary (or
even interpolation) search, each record is read once.
This is, basically, what I meant by the POINT= note addressed to Karsten.
Above, I have mentioned the paper whose MSW version of the paper (already
somewhat "improved" since SESUG'00 - making me recall my Mom's wisdom about
the better being the worst enemy of good) for everyone who has nothing
better to do but delve into the "deviant" SAS code coupled with a feeble
attempt to explain how it might work for those who, by some bizarre twist of
fortune, might actually somehow need it.
Happy New Year!
Paul M. Dorfman
>From: "Lewis, Dan BGI SF" <Dan.Lewis@barclaysglobal.com>
>To: "'paul_dorfman@HOTMAIL.COM'" <paul_dorfman@HOTMAIL.COM>
>Subject: Re: sorting efficiency
>Date: Thu, 28 Dec 2000 14:07:52 -0800
>I, for one,would much appreciate any material you could supply that would
>de-mystify this and other searching/sorting techniques.
> >> Of course, I do not have to tell you how to
> >> the scheme, but if someone thinks it is not self-evident and would like
> >> explanation, sample code, etc., I will gladly toss it in.
> >In the case of 5m x 52 variables, hashing methods are likely out of >the
> >question, but if there's a smaller table or fewer fields, it >should be
> >possible to construct a 5m x 2 element hash table in about >72 MB, if I'm
> >doing my math right. I was squeezing 40m elements >into roughly 5 GB on
> >Sun Ultra4 a couple years back. Amazing what >a little memory can buy
> >in process time.
>You might recall that back circa 1998 we discussed the possibility of
>solving this problem by using a hash table comprising 2 parallel arrays to
>store just the keys and record pointers, so that later at the time of
>(long) satellites could be retrieved by means of POINT= access. I tested
>idea then and concluded that performance was unsatisfactory, perhaps due to
>the cross-pagination. Since, I have returned to the idea and re-tested it.
>do not know what I got wrong the first time around in 1998, but after the
>recent subsequent tests I was amazed how efficiently it worked.
>The idea is hardly new and basically boils down to creating an explicit
>index instead of implicit B-tree index, resulting in huge gains in both
>and RAM efficiency. Of course, I do not have to tell you how to implement
>the scheme, but if someone thinks it is not self-evident and would like an
>explanation, sample code, etc., I will gladly toss it in.
>Happy New Year!
>Paul M. Dorfman
Get your FREE download of MSN Explorer at http://explorer.msn.com
Table Lookup By Direct Addressing.doc [application/msword]