|
Zhonghe Li <zli@HSPH.HARVARD.EDU> wrote:
> I am sort a 3.7 GB dataset on a computer with 16 free GB hard drive,
and 1
> GB of RAM. So i run the tagsort. It has been 6 hours already.
>
> Can any one tell me how long it may take?
Ooh. That's not good.
Since Paul has tossed my name out there, I'll go ahead and stick my
nose in. First off, TAGSORT may *not* be all that helpful. The
TAGSORT option is really good when you have a really 'wide' data set
(lots of variables and/or some really long strings) and a fairly
'narrow' key or set of keys you're sorting on. If you are sorting
on keys that take up most of the 'width' of the data set, then TAGSORT
may take a lot longer than doing an ordinary sort.
Second, your free space is more than 4 times the size of the data set.
So you should be able to do a straightforward sort without TAGSORT.
Third, how long does it take to do a read read through the data set?
If you end up generating a lot of network traffic, the time needed
when working with the data set may be enormous, no matter what you do.
Network traffic and/or disk I/O are often painful bottlenecks when
working with large data sets. Try to avoid both.
Fourth, you should probably try to re-design your process so you
don't *need* to do sorting (or you only need to sort once). Indexing
can save you lots of work when you are going to be pulling out pieces
of the data, and/or you need lots of different re-orderings of the
data. I once had a data set with a statistical algorithm which looked
like we would need 13 consecutive sort-step-and-data-step pieces,
and intensive examnation of the underlying process ultimately gave us
a different approach, requiring a DATA step, and then onea single
indexing step (using PROC DATASETS). It took plenty of time to work out
the new algorithmic structure, but it was more than worth it in the
long run.
HTH,
David
--
David Cassell, CSC
Cassell.David@epa.gov
Senior computing specialist
mathematical statistician
|