LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (December 2000, week 5)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Fri, 29 Dec 2000 20:16:22 -0000
Reply-To:   sashole@bellsouth.net
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Paul Dorfman <paul_dorfman@HOTMAIL.COM>
Subject:   Re: sorting efficiency
Comments:   To: Dan.Lewis@barclaysglobal.com
Content-Type:   multipart/mixed;

Dan,

This is a topic good enough for a book...which is a long story, but to make it short, here is the gist. "True" sortless file-matching methods ("untrue" ones still sort behind the scenes) are usually based on the following principle. You grab one of the files (usually the smaller one) and store its keys in some kind of look-up table. Such a table can be organized in a number of different ways. In particular, it can be an in-memory hash table.

From the standpoint of the DATA step, a hash table is simply a pretty sparse temporary array (i.e. having more slots than the number of distinct keys inserted into it). One of special hashing algorithms is used to insert the keys in the table. A very similar algorithm can then be used to search the table given a search key. Hashing algorithms are basically very simple and can be easily coded in the DATA step (I have implemented at least four of them; take a look at the attached paper if you are interested). Both hash insertion and search are very fast; suffices it is say that they vastly out-perform SAS formats.

If the file being hashed contains non-key variables (otherwise termed "satellite information"), they can be also stored in memory - in arrays parallel to the "key" hash table array. At the search time, if the search key is found, the satellite information can be retrieved from the corresponding array cell(s).

The look-up medium having been created, you can read the other file and for each record, search for its key in the hash table. If the key is found, you retrieve the satellites (if any), write the stuff out, or else do whatever your processing needs dictate.

In the nutshell, that is it. Now to the central point of the post. Dragging satellite(s) through the memory comes at the expense of rather steep increase in memory usage - the more severe, the longer the satellites are. That is why it seemed lucrative from the very beginning (harking precisely to this very time in 1998, when all the hashing "monstrosity", as poster to SAS-L christened it, broke loose) to handle satellites, especially long ones, in a shrewder fashion. Namely, instead of storing all of them in memory, one could use a single array parallel to the key hash table to store the observation numbers as satellites. At the search time, if the key (for instance, coming from the other file) has been found, we retrieve the satellite into a variable P and unleash POINT=P to fetch the long satellites from disk. Of course, the concept here is nothing more than old good indexing,only instead of storing the index in a disk-based B-tree, it is stored in a memory-resident hash table.

Using POINT= in such a manner raises the question of x-paging, since the observations with the keys that have been found are retrieved in no particular order with respect to the record number. For instance, in the well-known binary search performed against a disk-resident SAS data file, this presents a real danger. In principle, the phenomenon is capable of hindering performance badly enough to render a method based on the POINT= access practically inoperative. That is why, when I had tested it with the hash indexing for the first time and achieved rather dismal performance, I decided that x-paging was a non-conquerable culprit and wrote an e-mail to that effect to Karsten Self. (Karsten was the first to grasp the potential of hashing in SAS, and at the time we had been being on a month-long e-mail hashing spree. The latter culminated in the message I received from Karsten at work 1/2 hour before the end of 1998. Karsten had just hashed out - literally - some mainframe file match problem and entitled the message "Hash rocks, Dude!" - the motto I have been using in all my "hash" presentations as the final slide ever since. Thanks, Dude!) It also seemed to be in sync with my theoretical understanding of POINT='s pros and cons. However, it has turned out in subsequent, much more recent, tests, that originally I somehow got it wrong, and in fact, POINT= works just fine for the purpose. Yes, the observations are fetched out of order, but, unlike in the case of binary (or even interpolation) search, each record is read once.

This is, basically, what I meant by the POINT= note addressed to Karsten. Above, I have mentioned the paper whose MSW version of the paper (already somewhat "improved" since SESUG'00 - making me recall my Mom's wisdom about the better being the worst enemy of good) for everyone who has nothing better to do but delve into the "deviant" SAS code coupled with a feeble attempt to explain how it might work for those who, by some bizarre twist of fortune, might actually somehow need it.

Happy New Year!

Kind regards, ================== Paul M. Dorfman Jacksonville Fl ==================

>From: "Lewis, Dan BGI SF" <Dan.Lewis@barclaysglobal.com> >To: "'paul_dorfman@HOTMAIL.COM'" <paul_dorfman@HOTMAIL.COM> >Subject: Re: sorting efficiency >Date: Thu, 28 Dec 2000 14:07:52 -0800 > >I, for one,would much appreciate any material you could supply that would >de-mystify this and other searching/sorting techniques. > >Thanks Much, >Dan Lewis > > >> Of course, I do not have to tell you how to >implement > >> the scheme, but if someone thinks it is not self-evident and would like >an > >> explanation, sample code, etc., I will gladly toss it in. > > >From: kmself@IX.NETCOM.COM > > >In the case of 5m x 52 variables, hashing methods are likely out of >the > >question, but if there's a smaller table or fewer fields, it >should be > >possible to construct a 5m x 2 element hash table in about >72 MB, if I'm > >doing my math right. I was squeezing 40m elements >into roughly 5 GB on >a > >Sun Ultra4 a couple years back. Amazing what >a little memory can buy >you > >in process time. > >Karsten, > >You might recall that back circa 1998 we discussed the possibility of >solving this problem by using a hash table comprising 2 parallel arrays to >store just the keys and record pointers, so that later at the time of >search >(long) satellites could be retrieved by means of POINT= access. I tested >the >idea then and concluded that performance was unsatisfactory, perhaps due to >the cross-pagination. Since, I have returned to the idea and re-tested it. >I >do not know what I got wrong the first time around in 1998, but after the >recent subsequent tests I was amazed how efficiently it worked. > >The idea is hardly new and basically boils down to creating an explicit >hash >index instead of implicit B-tree index, resulting in huge gains in both >time >and RAM efficiency. Of course, I do not have to tell you how to implement >the scheme, but if someone thinks it is not self-evident and would like an >explanation, sample code, etc., I will gladly toss it in. > >Happy New Year! > >Kind regards, >=================== >Paul M. Dorfman >Jacksonville, Fl >=================== >_

_________________________________________________________________ Get your FREE download of MSN Explorer at http://explorer.msn.com


Table Lookup By Direct Addressing.doc [application/msword]


Back to: Top of message | Previous page | Main SAS-L page