LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (March 2008, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 5 Mar 2008 11:04:40 -0500
Reply-To:     Tom White <tw2@MAIL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Tom White <tw2@MAIL.COM>
Subject:      Re: Sorting many small data sets
Comments: To: Peter Crawford <peter.crawford@BLUEYONDER.CO.UK>
Content-Type: text/plain; charset="iso-8859-1"

Thanks Peter.

This is fascinating stuff.

I need to study Paul's response and yours sometime today.

Thanks.

T

----- Original Message ----- From: "Peter Crawford" To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Sorting many small data sets Date: Tue, 4 Mar 2008 16:47:28 -0500

Tom

I'd like to extend the process of handling all these datasets.

%macro mloop( max=, pattern= out$ ) ; %local i ; %do i = 1 %to &max ; %sysfunc( tranwrd( &pattern, $, &i )) %end ; %mend mloop ;

*That variation of the macro would be used like ; data b.food ; merge %mloop( max= 3 ) ; by PROVIDER_ID ; run ;

*The reason for the tranwrd() is that in these situations I often want to use &i more than once. The best example of "why" is to enable some simple code in the data step to provide a definition of the originating datasets in each data step iteration..... ........something like this extension, which creates a $1 variable that supports up to 8 sources ;

I've tested this trick on 7 tables (prepared earlier) :::>

option mprint; %let joins = 7 ; data b.food&joins ; merge %mloop( max= &joins, pattern= out$( in=_in$) ) ; by PROVIDER_ID ; _in_ = input( cats( of _in: ), binary&joins.. ); format _in_ binary&joins..; run ;

This created a nice column looking like _in_ 0000011 1111100 0000010 0000011 1111100 1111100 1000000 1000000 0000011 0000011

Since you are keen on a merge of the 150, this "_IN_" feature suggested by "data _null_" in the last few days, need to be dropped in favour of something that will indicate 150 sources and I think that the "binary." format won't go that far.

Of course, Ed points out: the 150 counts need to be considered ~ and there are more than 8 datasets.

So, lets use these counts to inticate the sources. Instead of deriving a new column, rename the existing counts. That pattern= macro parameter works for us, again !

data b.food&joins ; merge %mloop( max= &joins, pattern= out$( rename= count= count$) ) ; by PROVIDER_ID ; run ;

Here is log from joining just 7 279 data b.food&joins ; 280 merge %mloop( max= &joins, pattern= out$( rename= count= count$) ) MPRINT(MLOOP): out1( rename= count= count1) out2( rename= count= count2) out3( rename= count= count3) out4( rename= count= count4) out5( rename= count= count5) out6( rename= count= count6) out7( rename 281 ; by PROVIDER_ID ; 282 run ;

NOTE: There were 11 observations read from the data set WORK.OUT1. NOTE: There were 12 observations read from the data set WORK.OUT2. NOTE: There were 11 observations read from the data set WORK.OUT3. NOTE: There were 9 observations read from the data set WORK.OUT4. NOTE: There were 8 observations read from the data set WORK.OUT5. NOTE: There were 9 observations read from the data set WORK.OUT6. NOTE: There were 14 observations read from the data set WORK.OUT7. NOTE: The data set B.FOOD7 has 18 observations and 8 variables. NOTE: DATA statement used (Total process time): real time 0.03 seconds cpu time 0.04 seconds

Once it is built, deriving the number of contributing tables is simply n( of count: ) The other stats are as easily available.

However, it might wise to consider some alternative data handling because for 150 tables in the join, my merge created just 19 obs, and took 1.15 secs (instead of the demo above which took just .03 of one second). It is likely the the "runtime to number of tables and rows" relationship is not linear.

How long might a SAS data step merge take for a few thousand rows in each of the 150 tables ? If you want to try, here is the code I used to build my 150 small tables. It was convenient to use that %mloop() again. I expect you could simplify this.

data out0( keep= provider_id ); set sashelp.class( rename= name= provider_id ); run; proc sort ; by provider_id ; run; %mloop( max=150, pattern= %str( data out$; set out0; if ranuni(&i) >.5 ; count= int( ranuni(1)* 100 ); run; /* end of %str */ ) /* end of macro call */ )

PeterC

On Tue, 4 Mar 2008 15:45:44 -0500, Ed Heaton wrote:

> WAIT!!!! > > Your earlier email, Tom, said that you have only two variables - ID and > COUNT - in each of these 150 datasets. Now, I presume ID is really > PROVIDER_ID. So, your only other variable is these datasets exist in > every one of them!!! > > RULE #1: Never perform a merge with like-named variables that are not > listed in the BY statement. > > Suppose OUT1 and OUT2 each have a record where PROVIDER_ID=1001. For > that record, let's assume OUT1 has COUNT=1 and OUT2 has COUNT=2. Then > your output dataset will have a record for PROVIDER_ID=1001 with either > COUNT=1 or COUNT=2. Which value do you want in COUNT? - the value from > OUT1 or the value from OUT2? It might be hard to predict from your > code. > > Now, you are merging 150 datasets!!! What value do you want in COUNT > for your output dataset? - the value from OUT1 or the value from OUT2 or > the value from OUT3 or the value from OUT4 or ... ? > > What, exactly, are you trying to do? > > Ed > > Edward Heaton, Senior Systems Analyst, > Westat (An Employee-Owned Research Corporation), > 1650 Research Boulevard, TB-286, Rockville, MD 20850-3195 > Voice: (301) 610-4818 Fax: (301) 294-2085 > mailto:EdHeaton@Westat.com http://www.Westat.com > > > -----Original Message----- > From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu] > On Behalf Of Tom White > Sent: Tuesday, March 04, 2008 2:50 PM > To: data _null_,; SAS-L@listserv.uga.edu > Subject: Re: Re: Sorting many small data sets > > > Hello all again, Following the above post, now that I have sorted my > datasets,I would like to merge them and save the resualt to dataset FOO. > I try to modify Dennis' code but doesn't work: > > %macro mergeloop(max=); > > data b.foo; > > %do I = 1 %to &max; > > merge Out&I; > > by PROVIDER_ID; > > run; > > %end; > > %mend; > > %mergeloop(max=2); Help please. T > > ----- Original Message ----- > From: "data _null_," > To: SAS-L@LISTSERV.UGA.EDU > Subject: Re: Sorting many small data sets > Date: Tue, 4 Mar 2008 14:28:37 -0500 > > > There might be some benefit to using the data from the dictionary > tables. > > data t1 t2 t100 t125 t130; > do id = 1 to 10; > output; output; > end; > run; > data _null_; > set sashelp.vmember(keep=libname memname > where=(libname eq 'WORK' and memname eq: 'T')); > sort = catx(' ','proc sort nodupkey data=', > libname,'.',memname,'out=',libname,cats('.S',memname),'; by ID; > run;'); > call execute(strip(sort)); > put (_all_)(=); > run; > > On Tue, Mar 4, 2008 at 2:06 PM, Tom White wrote: > > This one works. Thanks Dennis. > > > > Thanks everyone for responding to this so fast! > > > > T > > > > > > > > > > ----- Original Message ----- > > From: "Diskin, Dennis" > > To: "Tom White" , SAS-L@LISTSERV.UGA.EDU > > Subject: RE: Sorting many small data sets > > Date: Tue, 4 Mar 2008 14:00:36 -0500 > > > > > > %macro sortloop(max=); > > %do I = 1 %to &max; > > proc sort data=t&I nodupkey out=out&I; > > by ID; > > run; > > %end; > > %mend; > > > > %sortloop(max=150); > > > > -----Original Message----- > > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf > Of Tom > > White > > Sent: Tuesday, March 04, 2008 1:32 PM > > To: SAS-L@LISTSERV.UGA.EDU > > Subject: Sorting many small data sets > > > > Hello Everyone, > > > > I have many small (about 100K obs and 2 fields: ID and COUNT) > datasets I > > need to sort. > > > > Their names are: T1, T2, T3, ..., T150. > > > > I would like to do this: > > > > proc sort data=T1 NoDupKey out=out1; > > by ID; > > run; > > > > proc sort data=T2 NoDupKey out=out2; > > by ID; > > run; > > > > ... > > ... > > > > proc sort data=T150 NoDupKey out=out150; by ID; run; > > > > How can I do this without all this work? > > > > Thank you. > > > > T

-- Want an e-mail address like mine? Get a free e-mail account today at www.mail.com!


Back to: Top of message | Previous page | Main SAS-L page