LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (December 1999, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 21 Dec 1999 18:09:09 -0500
Reply-To:     "Dorfman, Paul" <pdorfma@CITICORP.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Dorfman, Paul" <pdorfma@CITICORP.COM>
Subject:      Re: Data Merging
Comments: cc: "abdu.elnagheeb@BANKOFAMERICA.COM"
          <abdu.elnagheeb@BANKOFAMERICA.COM>
Content-Type: text/plain; charset="iso-8859-1"

If you wish to reply privately, please write to <sashole@earthlink.net> -----------------------------------------------------------------------

Abdu Elnagheeb, in part, wrote:

>I have a data set (say A) which has only one variable (say X) sorted. The data set has about 150,000 obs. I >want to merge to another data set (say B) by X. Data B has about 8 millions obs. and 140 variables. I use >MVS SAS. The way I did it was: > >* Read the first 500,000 from B, sort, then merge to A (If in A and B) >* Read the next 500,000 from B, sort, then merge to A (If in A and B) >* Repeat step 2 till I read the 8 millions. > >I want to see if some people have a more efficient way to do so?

Abdu,

The reason you are resorting to sorting (no pun intended) the file B in chunks is, as I understand, the difficulty of sorting B with 8E6 records and 140 variables as a whole. The truth is, however, that you absolutely DO NOT HAVE TO SORT ANYTHING, even the file A, to accomplish the task. Efficiency has several facets, including, but not limited to, programming time, execution time, and memory footprint. The latter is quite important, as the lack of memory may render certain methods executing rapidly when the memory is abundant very slow or downright inoperable.

From the standpoint of programming time, nothing beats this:

PROC SQL BUFFERSIZE=%EVAL(150000*3) _METHOD; CREATE TABLE AB AS SELECT B.* FROM A, B WHERE A.X = B.X ; QUIT;

The value supplied to BUFFERSIZE= here is of paramount importance: It determines whether or not SQL optimizer will decide to choose sqljhsh execution method over the combination of sqxjm, sqxsort, and sqxsrc. By making the BUFFERSIZE value sufficiently large, you are telling SQL optimizer that it has enough memory to select sqljhsh, that is, store all the keys from A in an in-memory hash table and look the table up for every record read from B outputting matches only. This way, it need not sort either A or B. However, if you let the value of BUFFERSIZE drop below certain threshold, the optimizer will decide to sort the files and merge them in the background, with all the consequences you are facing when doing it explicitly. The factor 3 in the %EVAL() represents such a threshold which I have determined purely experimentally by running the query under OS/390, just like you do. Of course, you can simply use 450000 instead of the %EVAL expression (I used it solely for the convenience of adjusting the factor).

If SQL decides to hash, the method is quite good from the standpoint of execution time as well. In particular, it runs about 50% faster than a format compiled with all the keys from A will, has about 50% smaller memory footprint, and, unlike the format, can swallow X-duplicates in A without choking - not to mention the fact that the time required to compile a format comprising 150000 keys is about 80% of the matching time itself.

For the above reasons, I assert that using 'large formats' in your situation is not a good idea. I am sure you are going to receive many replies suggesting just the opposite, so you will probably have to implement the suggestions with your data and judge for yourself.

If you could put up with a bit more of a programming effort in exchange for ultimate speed, both SQL (even with sqljhsh) and formats fade in comparison with hand-coded hashing. Beginning about a year ago, I posted a number of fairly verbose descriptions of the method including ready-to-go code, etc., using an example precisely mirroring your situation. Subsequently, Karsten Self, whose eloquence and command of the English language far exceeds mine, has posted several articles also explaining the essence of the technique. Finally, owing to Ian Whitlock and Joe Kelley, I presented a more systematic overview of direct addressing methods, including key-indexing and hashing, as an Advanced Tutorial at SESUG'99, so it is available in the Conference Proceedings under the title 'Array Lookup Techniques, Part 2: From Key-indexing to Hashing'. If you have no access to Proceedings but would like to, please drop me a note at <sashole@earthlink.net>, and I will send you a copy in Word. (Those who attended SESUG had a unique chance to absorb a brilliant Ian Whitlock's interpretation of hashing in SAS given ad-hoc by the Master of the Universe.)

Lastly, keep in mind that the percentage figures given above were obtained without a heavy satellite tail in file B. Since yours is quite heavy (140 variables), the run-time differences between one or another method can be very well masked by inefficient I/O, for instance, if the data set B has a wimpy BUFSIZE (i.e. page size) and/or BUFNO is overly small. Everything a mortal can possibly learn about the way these options affect SAS performance can be found in the immortal 'Tuning SAS Applications in the MVS Environment' by Michael A. Raithel. Needless to say, since you work under OS/390, the book resides right on your desk at all times.

Have a happy holiday season!

Kind regards, ======================== Paul M. Dorfman CitiCorp Universal Card Jacksonville, FL ========================


Back to: Top of message | Previous page | Main SAS-L page