|
On mainframes I was used to avoid sorting with SAS "native", but using a
external sort utility (DFSORT, SYNCSORT, ...). SAS supports and suggests
this.
I think there are several utilities for UNIX and Win also. Maybe it is
worth to try some of them and sort big files externally?
If it's possible it is always best to avoid sorts. I don't mean replacing
it with an index! A index is useful if you have a rather static dataset
where you frequently select records from, preferably only a few of a big
amount. Selecting them all, eg. as input for a PROC or DATA-step might be
not a good idea. And for each update you have to rebuild the index.
What I mean to avoid sorts is, if a big dataset IS once sorted, it must
not be sorted again if any adding of records keeps the sort order by using
BY ...;
If you are adding some records (or also delete some) you are told that
it's very fast to use PROC APPEND. Ok, true, but the sort order is
destroyed. Using a DATA-step with BY is slower, but you don't need a PROC
SORT to get it into the right order. That might be faster all in all.
If you are deleting records, why not storing a "active" flag and set it to
0 instead of deleting? That lets you easily select the not-deleted records
and you have the chance to reactivate them if you want. "Deleting" a
record simply means to UPDATE in a DATA-step with another dataset which
contains the record-keys to delete. That might be faster than physically
deleting records by DATA ..., or PROC SORT. And you could define a VIEW
with "WHERE active=1" to make that transparent for users.
I agree, that is not possible very often. Most of the times you need it
sorted in many different orders, so it is no big win to have a "major"
sort. In that case, try some external (optimized) sort tools.
On mainframes I had good results with that.
Gerhard
On Wed, 11 Feb 2009 10:11:21 -0800, Nordlund, Dan (DSHS/RDA)
<NordlDJ@DSHS.WA.GOV> wrote:
>> -----Original Message-----
>> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On
>> Behalf Of Joe Matise
>> Sent: Wednesday, February 11, 2009 8:54 AM
>> To: SAS-L@LISTSERV.UGA.EDU
>> Subject: Re: How to improve sorting efficency
>>
>> First of all ... sort it locally.
>> proc sort data=lib.main out=work.main;
>> ...
>> run;
>>
>> Sorting without the local OUT statement generates the sort
>> file (equal in
>> size to the original file) on the network drive - meaning
>> slow performance.
>> I've found that sorting it locally and then just copying it
>> back tends to be
>> faster (counterintuitively), but 99% of the time when I do a
>> sort of that
>> nature, I want it for local purposes anyway.
>>
>> Also consider using an index - if it's something you can
>> easily index by, it
>> will be faster than outputting the entire dataset again. Of
>> course, that
>> doesn't usually work with id numbers, unless people occur
>> many times in the
>> dataset, but it's something to keep in mind.
>>
>> -Joe
>>
>> On Wed, Feb 11, 2009 at 9:32 AM, Yu Zhang <zhangyu05@gmail.com> wrote:
>>
>> > All,
>> >
>> > I know Sorting questions have been asking very frequently
>> in the past. also
>> > I have looked at some old posts related to my question.
>> However, after I
>> > tried different things in my program, the one sorting step
>> still take more
>> > than 50 min to finish.
>> >
>> > i have a SAS dataset resides on a networking drive. it has
>> 170,000 rows and
>> > 600+ variables. we originally use proc sort:
>> >
>> > proc sort lib.main (keep=id_num,q1: q5:) out=main ;
>> > by id_num;
>> > run;
>> >
>> > 15 variables were kept and sort by Id_num. this step will
>> run at least 54
>> > mins.
>> >
>> > I modified the program using HASH object and hope to get it
>> done quicker.
>> > but It still needs 50 mins.
>> >
>> > data _null_;
>> > if 0 set LIB.main;
>> > if_n_=1 then do;
>> > dcl hash _M(ordered:"Yes");
>> > rc=_m.definekey("id_num","_n_");
>> > rc=_m.definedate("id_num","q1_1"............,"q5_5");
>> > rc=_m.definedone();
>> > end;
>> > do until(end);
>> > set lib.main (keep=id_num,q1: q5:) end=end;
>> > _n_+1;
>> > _m.add();
>> > end;
>> > rc=_m.output(dataset:"main");
>> > run;
>> >
>> > i am guessing our slow network traffic is the culprit. My
>> computer is
>> > pretty
>> > new and has 200G disk space.
>> >
>> > can anyone improve this?
>> >
>> > Thank you for your suggestions.
>> >
>> > Yu
>> >
>>
>
>How is the data going to be used after the sort? If you will be reading
through the whole file in sort order (not just going after small portions
of it), I wouldn't rely on indexing alone. My experience is that reading
an indexed (but not sorted) file, in index order, is a nightmare when it
comes to performance. You might be able to tune performance by changing
the size and/or number of buffers, but if you will need to read through
the file in index/sort order at least once in a while, I would sort it and
maybe index it as well.
>
>Maybe Michael Raithel will chime in here if he has time. I probably
should add his book on indexes to the buy list at work. :-)
>
>Dan
>
>Daniel J. Nordlund
>Washington State Department of Social and Health Services
>Planning, Performance, and Accountability
>Research and Data Analysis Division
>Olympia, WA 98504-5204
|