LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (December 2000, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Thu, 28 Dec 2000 00:51:15 -0800
Reply-To:     kmself@IX.NETCOM.COM
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         kmself@IX.NETCOM.COM
Subject:      Re: sorting efficiency
In-Reply-To:  <C75579A077CD4D115A160005B8ACC242@regivm.indya.com>; from
              regivm@INDYA.COM on Thu, Dec 28, 2000 at 10:10:07AM +0530
Content-Type: multipart/signed; micalg=pgp-sha1;
protocol="application/pgp-signature";

on Thu, Dec 28, 2000 at 10:10:07AM +0530, regivm@INDYA.COM (regivm@INDYA.COM) wrote:

> > From: He-Man[SMTP:douglas_19@HOTMAIL.COM] > > Sent: Wednesday, December 27, 2000 3:54 PM > > Subject: sorting efficiencies > > > > Hi there, > > > > I am currently using some large longitudinal datasets of about 5.8 million > > obs with about 52 variables. I find that sorting these datasets apprears to > > take forever and I was wondering if there are any better ways to perform a > > sort. The usual sort that I do is on cust_id and extr_dte as the sorting is > > preparation for merging with other related records (usually partner and > > child records) to create income unit record data. > > > > Each obs contains and ID (non unique- cust_id) and an extraction date > > (extr_dte) and details such as income, family composition, social security > > info, housing details etc. The first ID (say ID# 81) has about 110 > > occurances with fortnightly extraction dates from 1995 to 2000. > > > > The time for sorting is quite long and using this code I get > > > > 81 proc sort data=lds.sample; > > 82 by cust_id; > > 83 run; > > > > NOTE: There were 5858672 observations read from the dataset LDS.SAMPLE. > > NOTE: The data set LDS.SAMPLE has 5858672 observations and 52 variables. > > NOTE: PROCEDURE SORT used: > > real time 41:42.74 > > > > I have heard that one can use something like "sharebuffers" but I have no > > idea how to use that in this type of sort. Would an index would help much? > > The sorting variable has about 50,000 different values. > > Hi SAS-Lers,

> Doug was sorting the dataset in preparation for merging with other > datasets. Why not use Proc SQL for merging and avoid the need for > sorting ?

You don't.

More precisely, you may or may not avoid a sort, but it's highly unlikely in this case.

Paul Kent of SI has written several very worthwhile SUGI papers on SAS SQL optimization. Briefly (and with a modicum of accuracy), the SQL query optimizer will do one of the following in a table join. Not in any particular order:

- Hold one table in memory, if sufficiently small, and join against the larger table from disk.

- Hash the smaller table, given certain memory constraints, which are ludicrously low given current hardware profiles, and join against the larger table from disk. This performance is tunable. A hand-written hash a'la Paul Dorfman is still more efficient, but you've got to code it.

- Use an existing index to join tables.

- Sort one or both tables and perform what's effectively a data set merge.

The behavior can be observed with one or more undocumented PROC SQL options, I believe the high-level overview is given with the /method option.

In Doug's case, SQL would almost certainly sort and merge the two tables (datasets), for a roughly equivalent performance profile to what I can only presume (he hasn't shown us any code) was a PROC SORT + Data step merge.

In the case of 5m x 52 variables, hashing methods are likely out of the question, but if there's a smaller table or fewer fields, it should be possible to construct a 5m x 2 element hash table in about 72 MB, if I'm doing my math right. I was squeezing 40m elements into roughly 5 GB on a Sun Ultra4 a couple years back. Amazing what a little memory can buy you in process time. Too bad about the NT graphics overhead, GNU/Linux would eat that right up.

-- Karsten M. Self <kmself@ix.netcom.com> http://kmself.home.netcom.com/ Evangelist, Zelerate, Inc. http://www.zelerate.org What part of "Gestalt" don't you understand? There is no K5 cabal http://gestalt-system.sourceforge.net/ http://www.kuro5hin.org


[application/pgp-signature]


Back to: Top of message | Previous page | Main SAS-L page