|Date: ||Fri, 29 Dec 2000 13:09:45 -0800|
|Sender: ||"SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>|
|Subject: ||Re: sorting efficiency|
|In-Reply-To: ||<F201niFKULPEhGD1XyC0000adf6@hotmail.com>; from
paul_dorfman@HOTMAIL.COM on Thu, Dec 28, 2000 at 04:55:13PM -0000|
|Content-Type: ||multipart/signed; micalg=pgp-sha1;
on Thu, Dec 28, 2000 at 04:55:13PM -0000, Paul Dorfman (paul_dorfman@HOTMAIL.COM) wrote:
> From: kmself@IX.NETCOM.COM
> >In the case of 5m x 52 variables, hashing methods are likely out of >the
> >question, but if there's a smaller table or fewer fields, it >should be
> >possible to construct a 5m x 2 element hash table in about >72 MB, if I'm
> >doing my math right. I was squeezing 40m elements >into roughly 5 GB on a
> >Sun Ultra4 a couple years back. Amazing what >a little memory can buy you
> >in process time.
> You might recall that back circa 1998 we discussed the possibility of
> solving this problem by using a hash table comprising 2 parallel
> arrays to store just the keys and record pointers, so that later at
> the time of search (long) satellites could be retrieved by means of
> POINT= access. I tested the idea then and concluded that performance
> was unsatisfactory, perhaps due to the cross-pagination. Since, I have
> returned to the idea and re-tested it. I do not know what I got wrong
> the first time around in 1998, but after the recent subsequent tests I
> was amazed how efficiently it worked.
Not sure what you mean by "cross-pagination", but IIRC the issue was the
SAS read-ahead buffer, which was restricted to some pagesize on the
order of 4-8K, meaning that in a large dataset with a large number of
randomly-accessed reads, you were reading large chunks of data only to
discard them virtually immediately on reading another large chunk.
Read-ahead is great when you're doing long, sequential, scans. They
suck massively when you're doing hunt-and-peck, in which case you want
to *minimize* them.
Not sure what's changed, possibilities include SI optimizing the DATA
step (or a particular read buffer) for such access by identifying steps
in which POINT= is the primary (or only) data access method, from
increased memory (hence: disk caching) on your system(s), from improved
I/O drivers at the system or SI level. I've noticed that several of our
criticisms of SAS performance in hash methods have more-or-less silently
been added to or incorporated into v7/8 of SAS, including the ability to
create arrays with elements sized in chunks other than 8-byte offsets.
Haven't played with this at all in the past year or so, but would be
interested to know what's changed.
Karsten M. Self <email@example.com> http://kmself.home.netcom.com/
Evangelist, Zelerate, Inc. http://www.zelerate.org
What part of "Gestalt" don't you understand? There is no K5 cabal