Date: Tue, 21 Dec 1999 18:09:09 -0500
Reply-To: "Dorfman, Paul" <pdorfma@CITICORP.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Dorfman, Paul" <pdorfma@CITICORP.COM>
Subject: Re: Data Merging
Content-Type: text/plain; charset="iso-8859-1"
If you wish to reply privately, please write to <firstname.lastname@example.org>
Abdu Elnagheeb, in part, wrote:
>I have a data set (say A) which has only one variable (say X) sorted. The
data set has about 150,000 obs. I >want to merge to another data set (say
B) by X. Data B has about 8 millions obs. and 140 variables. I use >MVS
SAS. The way I did it was:
>* Read the first 500,000 from B, sort, then merge to A (If in A and B)
>* Read the next 500,000 from B, sort, then merge to A (If in A and B)
>* Repeat step 2 till I read the 8 millions.
>I want to see if some people have a more efficient way to do so?
The reason you are resorting to sorting (no pun intended) the file B in
chunks is, as I understand, the difficulty of sorting B with 8E6 records and
140 variables as a whole. The truth is, however, that you absolutely DO NOT
HAVE TO SORT ANYTHING, even the file A, to accomplish the task. Efficiency
has several facets, including, but not limited to, programming time,
execution time, and memory footprint. The latter is quite important, as the
lack of memory may render certain methods executing rapidly when the memory
is abundant very slow or downright inoperable.
From the standpoint of programming time, nothing beats this:
PROC SQL BUFFERSIZE=%EVAL(150000*3) _METHOD;
CREATE TABLE AB AS
FROM A, B
WHERE A.X = B.X
The value supplied to BUFFERSIZE= here is of paramount importance: It
determines whether or not SQL optimizer will decide to choose sqljhsh
execution method over the combination of sqxjm, sqxsort, and sqxsrc. By
making the BUFFERSIZE value sufficiently large, you are telling SQL
optimizer that it has enough memory to select sqljhsh, that is, store all
the keys from A in an in-memory hash table and look the table up for every
record read from B outputting matches only. This way, it need not sort
either A or B. However, if you let the value of BUFFERSIZE drop below
certain threshold, the optimizer will decide to sort the files and merge
them in the background, with all the consequences you are facing when doing
it explicitly. The factor 3 in the %EVAL() represents such a threshold which
I have determined purely experimentally by running the query under OS/390,
just like you do. Of course, you can simply use 450000 instead of the %EVAL
expression (I used it solely for the convenience of adjusting the factor).
If SQL decides to hash, the method is quite good from the standpoint of
execution time as well. In particular, it runs about 50% faster than a
format compiled with all the keys from A will, has about 50% smaller memory
footprint, and, unlike the format, can swallow X-duplicates in A without
choking - not to mention the fact that the time required to compile a format
comprising 150000 keys is about 80% of the matching time itself.
For the above reasons, I assert that using 'large formats' in your situation
is not a good idea. I am sure you are going to receive many replies
suggesting just the opposite, so you will probably have to implement the
suggestions with your data and judge for yourself.
If you could put up with a bit more of a programming effort in exchange for
ultimate speed, both SQL (even with sqljhsh) and formats fade in comparison
with hand-coded hashing. Beginning about a year ago, I posted a number of
fairly verbose descriptions of the method including ready-to-go code, etc.,
using an example precisely mirroring your situation. Subsequently, Karsten
Self, whose eloquence and command of the English language far exceeds mine,
has posted several articles also explaining the essence of the technique.
Finally, owing to Ian Whitlock and Joe Kelley, I presented a more systematic
overview of direct addressing methods, including key-indexing and hashing,
as an Advanced Tutorial at SESUG'99, so it is available in the Conference
Proceedings under the title 'Array Lookup Techniques, Part 2: From
Key-indexing to Hashing'. If you have no access to Proceedings but would
like to, please drop me a note at <email@example.com>, and I will send
you a copy in Word. (Those who attended SESUG had a unique chance to absorb
a brilliant Ian Whitlock's interpretation of hashing in SAS given ad-hoc by
the Master of the Universe.)
Lastly, keep in mind that the percentage figures given above were obtained
without a heavy satellite tail in file B. Since yours is quite heavy (140
variables), the run-time differences between one or another method can be
very well masked by inefficient I/O, for instance, if the data set B has a
wimpy BUFSIZE (i.e. page size) and/or BUFNO is overly small. Everything a
mortal can possibly learn about the way these options affect SAS performance
can be found in the immortal 'Tuning SAS Applications in the MVS
Environment' by Michael A. Raithel. Needless to say, since you work under
OS/390, the book resides right on your desk at all times.
Have a happy holiday season!
Paul M. Dorfman
CitiCorp Universal Card