Date: Thu, 28 Dec 2000 00:51:15 -0800
Reply-To: kmself@IX.NETCOM.COM
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: kmself@IX.NETCOM.COM
Subject: Re: sorting efficiency
In-Reply-To: <C75579A077CD4D115A160005B8ACC242@regivm.indya.com>; from
regivm@INDYA.COM on Thu, Dec 28, 2000 at 10:10:07AM +0530
Content-Type: multipart/signed; micalg=pgp-sha1;
protocol="application/pgp-signature";
on Thu, Dec 28, 2000 at 10:10:07AM +0530, regivm@INDYA.COM (regivm@INDYA.COM) wrote:
> > From: He-Man[SMTP:douglas_19@HOTMAIL.COM]
> > Sent: Wednesday, December 27, 2000 3:54 PM
> > Subject: sorting efficiencies
> >
> > Hi there,
> >
> > I am currently using some large longitudinal datasets of about 5.8 million
> > obs with about 52 variables. I find that sorting these datasets apprears to
> > take forever and I was wondering if there are any better ways to perform a
> > sort. The usual sort that I do is on cust_id and extr_dte as the sorting is
> > preparation for merging with other related records (usually partner and
> > child records) to create income unit record data.
> >
> > Each obs contains and ID (non unique- cust_id) and an extraction date
> > (extr_dte) and details such as income, family composition, social security
> > info, housing details etc. The first ID (say ID# 81) has about 110
> > occurances with fortnightly extraction dates from 1995 to 2000.
> >
> > The time for sorting is quite long and using this code I get
> >
> > 81 proc sort data=lds.sample;
> > 82 by cust_id;
> > 83 run;
> >
> > NOTE: There were 5858672 observations read from the dataset LDS.SAMPLE.
> > NOTE: The data set LDS.SAMPLE has 5858672 observations and 52 variables.
> > NOTE: PROCEDURE SORT used:
> > real time 41:42.74
> >
> > I have heard that one can use something like "sharebuffers" but I have no
> > idea how to use that in this type of sort. Would an index would help much?
> > The sorting variable has about 50,000 different values.
> > Hi SAS-Lers,
> Doug was sorting the dataset in preparation for merging with other
> datasets. Why not use Proc SQL for merging and avoid the need for
> sorting ?
You don't.
More precisely, you may or may not avoid a sort, but it's highly
unlikely in this case.
Paul Kent of SI has written several very worthwhile SUGI papers on SAS
SQL optimization. Briefly (and with a modicum of accuracy), the SQL
query optimizer will do one of the following in a table join. Not in
any particular order:
- Hold one table in memory, if sufficiently small, and join against
the larger table from disk.
- Hash the smaller table, given certain memory constraints, which are
ludicrously low given current hardware profiles, and join against
the larger table from disk. This performance is tunable. A
hand-written hash a'la Paul Dorfman is still more efficient, but
you've got to code it.
- Use an existing index to join tables.
- Sort one or both tables and perform what's effectively a data set
merge.
The behavior can be observed with one or more undocumented PROC SQL
options, I believe the high-level overview is given with the /method
option.
In Doug's case, SQL would almost certainly sort and merge the two tables
(datasets), for a roughly equivalent performance profile to what I can
only presume (he hasn't shown us any code) was a PROC SORT + Data step
merge.
In the case of 5m x 52 variables, hashing methods are likely out of the
question, but if there's a smaller table or fewer fields, it should be
possible to construct a 5m x 2 element hash table in about 72 MB, if I'm
doing my math right. I was squeezing 40m elements into roughly 5 GB on
a Sun Ultra4 a couple years back. Amazing what a little memory can buy
you in process time. Too bad about the NT graphics overhead, GNU/Linux
would eat that right up.
--
Karsten M. Self <kmself@ix.netcom.com> http://kmself.home.netcom.com/
Evangelist, Zelerate, Inc. http://www.zelerate.org
What part of "Gestalt" don't you understand? There is no K5 cabal
http://gestalt-system.sourceforge.net/ http://www.kuro5hin.org
[application/pgp-signature]