LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (February 2009, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Wed, 11 Feb 2009 13:50:13 -0500
Reply-To:   Gerhard Hellriegel <gerhard.hellriegel@T-ONLINE.DE>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Gerhard Hellriegel <gerhard.hellriegel@T-ONLINE.DE>
Subject:   Re: How to improve sorting efficency

On mainframes I was used to avoid sorting with SAS "native", but using a external sort utility (DFSORT, SYNCSORT, ...). SAS supports and suggests this. I think there are several utilities for UNIX and Win also. Maybe it is worth to try some of them and sort big files externally?

If it's possible it is always best to avoid sorts. I don't mean replacing it with an index! A index is useful if you have a rather static dataset where you frequently select records from, preferably only a few of a big amount. Selecting them all, eg. as input for a PROC or DATA-step might be not a good idea. And for each update you have to rebuild the index. What I mean to avoid sorts is, if a big dataset IS once sorted, it must not be sorted again if any adding of records keeps the sort order by using BY ...;

If you are adding some records (or also delete some) you are told that it's very fast to use PROC APPEND. Ok, true, but the sort order is destroyed. Using a DATA-step with BY is slower, but you don't need a PROC SORT to get it into the right order. That might be faster all in all. If you are deleting records, why not storing a "active" flag and set it to 0 instead of deleting? That lets you easily select the not-deleted records and you have the chance to reactivate them if you want. "Deleting" a record simply means to UPDATE in a DATA-step with another dataset which contains the record-keys to delete. That might be faster than physically deleting records by DATA ..., or PROC SORT. And you could define a VIEW with "WHERE active=1" to make that transparent for users.

I agree, that is not possible very often. Most of the times you need it sorted in many different orders, so it is no big win to have a "major" sort. In that case, try some external (optimized) sort tools. On mainframes I had good results with that. Gerhard

On Wed, 11 Feb 2009 10:11:21 -0800, Nordlund, Dan (DSHS/RDA) <NordlDJ@DSHS.WA.GOV> wrote:

>> -----Original Message----- >> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On >> Behalf Of Joe Matise >> Sent: Wednesday, February 11, 2009 8:54 AM >> To: SAS-L@LISTSERV.UGA.EDU >> Subject: Re: How to improve sorting efficency >> >> First of all ... sort it locally. >> proc sort data=lib.main out=work.main; >> ... >> run; >> >> Sorting without the local OUT statement generates the sort >> file (equal in >> size to the original file) on the network drive - meaning >> slow performance. >> I've found that sorting it locally and then just copying it >> back tends to be >> faster (counterintuitively), but 99% of the time when I do a >> sort of that >> nature, I want it for local purposes anyway. >> >> Also consider using an index - if it's something you can >> easily index by, it >> will be faster than outputting the entire dataset again. Of >> course, that >> doesn't usually work with id numbers, unless people occur >> many times in the >> dataset, but it's something to keep in mind. >> >> -Joe >> >> On Wed, Feb 11, 2009 at 9:32 AM, Yu Zhang <zhangyu05@gmail.com> wrote: >> >> > All, >> > >> > I know Sorting questions have been asking very frequently >> in the past. also >> > I have looked at some old posts related to my question. >> However, after I >> > tried different things in my program, the one sorting step >> still take more >> > than 50 min to finish. >> > >> > i have a SAS dataset resides on a networking drive. it has >> 170,000 rows and >> > 600+ variables. we originally use proc sort: >> > >> > proc sort lib.main (keep=id_num,q1: q5:) out=main ; >> > by id_num; >> > run; >> > >> > 15 variables were kept and sort by Id_num. this step will >> run at least 54 >> > mins. >> > >> > I modified the program using HASH object and hope to get it >> done quicker. >> > but It still needs 50 mins. >> > >> > data _null_; >> > if 0 set LIB.main; >> > if_n_=1 then do; >> > dcl hash _M(ordered:"Yes"); >> > rc=_m.definekey("id_num","_n_"); >> > rc=_m.definedate("id_num","q1_1"............,"q5_5"); >> > rc=_m.definedone(); >> > end; >> > do until(end); >> > set lib.main (keep=id_num,q1: q5:) end=end; >> > _n_+1; >> > _m.add(); >> > end; >> > rc=_m.output(dataset:"main"); >> > run; >> > >> > i am guessing our slow network traffic is the culprit. My >> computer is >> > pretty >> > new and has 200G disk space. >> > >> > can anyone improve this? >> > >> > Thank you for your suggestions. >> > >> > Yu >> > >> > >How is the data going to be used after the sort? If you will be reading through the whole file in sort order (not just going after small portions of it), I wouldn't rely on indexing alone. My experience is that reading an indexed (but not sorted) file, in index order, is a nightmare when it comes to performance. You might be able to tune performance by changing the size and/or number of buffers, but if you will need to read through the file in index/sort order at least once in a while, I would sort it and maybe index it as well. > >Maybe Michael Raithel will chime in here if he has time. I probably should add his book on indexes to the buy list at work. :-) > >Dan > >Daniel J. Nordlund >Washington State Department of Social and Health Services >Planning, Performance, and Accountability >Research and Data Analysis Division >Olympia, WA 98504-5204


Back to: Top of message | Previous page | Main SAS-L page