LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (November 2006, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 14 Nov 2006 22:36:11 -0800
Reply-To:     David L Cassell <davidlcassell@MSN.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         David L Cassell <davidlcassell@MSN.COM>
Subject:      Re: Sorting a huge huge dataset
In-Reply-To:  <1163505947.349264.219960@k70g2000cwa.googlegroups.com>
Content-Type: text/plain; format=flowed

ckalisetty@GMAIL.COM wrote back: > >Thanks for all the wonderful messages. This group simply rock. >Most of the messages are pointing to hash. I have to confess that I >never used this and can use little help. Can somebody please give a >brief introduction and point me toward a SUGI paper where I can read >further? > >cu8sfan wrote: > > You may also break up the original dataset without sorting by size and > > then putting it together using the by-statement. > > > > data temp1.bigfile_1 > > temp1.bigfile_2 > > temp1.bigfile_3 > > ; > > set imp_flat.bigfile; > > if _N_ <= 1000000 then output temp1.bigfile_1; > > else if _N_ <= 2000000 then output temp1.bigfile_2; > > else output temp1.bigfile_3; > > run; > > > > Now sort bigfile_1 - bigfile_3 by key_variable. > > > > Now put them back together in sorted order: > > > > data research.sortedfile; > > set temp1.bigfile_1 > > temp1.bigfile_2 > > temp1.bigfile_3 > > ; > > by key_variable; > > run; > > > > Of course with a billion records you'd use a little bit more than just > > three smaller datasets. I'd produce them using macro code. > > > > Richard Reeves schrieb: > > > > > I have similar sized data and I have to break the file into smaller > > > parts by the primary sort field. Remember to use the if/else command > > > sequence. This helps because the file only needs to be read one time. > > > I ran into space issues trying to sort the larger file which made > > > breaking it up necessary. > > > > > > Then sort those smaller files and use the proc append or proc /append > > > function to add them back together. Rich > > > > > > Here is an example with untested code (I chose not to do a macro on >the > > > front of this because of my newness to macros but did use one on the > > > backend): > > > > > > DATA TEMP1.bigfile_1 > > > TEMP1.bigfile_2 > > > TEMP1.bigfile_3 > > > TEMP1.bigfile_4 > > > TEMP1.bigfile_5 > > > ; > > > SET IMP_FLAT.bigfile; > > > IF bigfile_key3 LE 6000000 THEN OUTPUT > > > TEMP1.bigfile_1; > > > ELSE IF bigfile_key3 LE 12000000 THEN OUTPUT > > > TEMP1.bigfile_2; > > > ELSE IF bigfile_key3 LE 18000000 THEN OUTPUT > > > TEMP1.bigfile_3; > > > ELSE IF bigfile_key3 LE 24000000 THEN OUTPUT > > > TEMP1.bigfile_4; > > > ELSE IF bigfile_key3 LE 30000000 THEN OUTPUT > > > TEMP1.bigfile_5; > > > RUN; > > > QUIT; > > > DATA RESEARCH.sortedfile; > > > SET TEMP1.TEMP1.bigfile_1; > > > RUN; > > > %macro APP(FILES); > > > %do i=2 %to &FILES; > > > PROC APPEND DATA=TEMP1.bigfile_&i BASE=permlib.sortedfile; > > > PROC SQL; > > > DROP TABLE TEMP1.bigfile_&i; > > > QUIT; > > > %end; > > > %mend APP; > > > %APP(5) > > > RUN; > > > -----Original Message----- > > > From: SAS(r) Discussion [mailto:SAS-L@listserv.vt.edu] On Behalf Of > > > Sekhar > > > Sent: Monday, November 13, 2006 2:08 PM > > > To: SAS-L@LISTSERV.VT.EDU > > > Subject: Sorting a huge huge dataset > > > > > > Hi > > > I am trying to merge two datasets one having 1.7 billion records and > > > the other one having 35 million records. The first datset has accoutn > > > number and the second datset has opeinign date. The merge is on >account > > > number. So far so good.But sorting the first file on account number is > > > taking almost 8 hours.Any other alternatives for doing this or any > > > other methods for getting the opening date in to the first file. Tag > > > sort, Proc SQl, Indexing?? Any suggestions? Should cahnge the system > > > options like memsize,sortsize etc.. I am more concerned about cutting > > > down the time than space issues.

In addition to the papers already cited, let me point you at Gregg Snell's excellent tutorial "Think FAST! Use Memory Tables (Hashing) for Faster Merging". I think this is exactly what you're looking for.

http://www2.sas.com/proceedings/sugi31/244-31.pdf

HTH, David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330

_________________________________________________________________ Use your PC to make calls at very low rates https://voiceoam.pcs.v2s.live.com/partnerredirect.aspx


Back to: Top of message | Previous page | Main SAS-L page