Jeff Morison <jmt_mtf@YAHOO.COM> wrote:
> I have a dataset containing 16 million observations
> of account numbers plus 10 other variables, I need to
> sort it by acct# and remove the duplicate account#
> What is fastest way to do this?.
> PROC SORT with NODUPKEY is draining all my resources.
> Any other efficient ways to do this.
 16 M records isn't all that large, by today's standards.
But it could easily be swamping your available memory. SAS
likes to have roughly 3-4 times the size of the data set free
on your hard drive for doing that sort.you could indeed be choking
your machine. So the *fastest* way to do this is to move the
data set to a faster machine with more RAM and bigger, faster
hard drives, and run the program there.
 Option #1 never happens. No one ever tells me, "Oh, by the
way, David, I'll be out of my office for a few weeks, so feel free
to use my hand-crafter 2000-node Beowulf cluster while I'm gone."
So my next option would, in your case, to see if adding the option
TAGSORT to the PROC SORT statement does the trick. It is likely
to slow down the sort, but it will use a lot less memory. My paper
SUGI 26: A Sort of a Mess -- Sorting Large Datasets on Multiple Keys
Paper 121-261 A Sort of a Mess ?Sorting Large Datasets on Multiple Keys
David L. Cassell, http://www2.sas.com/proceedings/sugi26/p121-26.pdf -
has details on this.
 My paper also has alternatives, with some advice on when to use
including use of indexing (through PROC DATASETS, if you want) and
other techniques. You begin to give up programmer efficiency for
other efficiencies as you move these more complex methods.
 Rather than sorting the file at all, if the records really are
you could use some of the hashing techniques of Paul Dorfman to maintain
an associative array of already-found account-numbers and ditch anything
already in your hash. Paul has written extensively on this, both in
and in SUGI papers. This would save a *lot* of time, since you could do
this as a one-pass solution (assuming the records really are true
and can be chucked or passed into a different data set without
without any sorting.
 If option #4 sounds appealing and you have SAS 9, you can use the
data step hashes now available (thanks to lots of whinging for hashes in
data step by people such as Paul) to do option #4 without coding the
by hand. This would also get you out of any constraint of maintaining
the entire hash in RAM, as well as avoiding the fun of testing your code
make sure it really does load and search an associative array properly.
David Cassell, CSC
Senior computing specialist