Date: Wed, 5 Mar 2008 11:04:40 -0500
Reply-To: Tom White <tw2@MAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Tom White <tw2@MAIL.COM>
Subject: Re: Sorting many small data sets
Content-Type: text/plain; charset="iso-8859-1"
Thanks Peter.
This is fascinating stuff.
I need to study Paul's response and yours sometime today.
Thanks.
T
----- Original Message -----
From: "Peter Crawford"
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Sorting many small data sets
Date: Tue, 4 Mar 2008 16:47:28 -0500
Tom
I'd like to extend the process of handling all these datasets.
%macro mloop( max=, pattern= out$ ) ;
%local i ;
%do i = 1 %to &max ;
%sysfunc( tranwrd( &pattern, $, &i ))
%end ;
%mend mloop ;
*That variation of the macro would be used like ;
data b.food ;
merge %mloop( max= 3 ) ;
by PROVIDER_ID ;
run ;
*The reason for the tranwrd() is that in these situations I often
want to use &i more than once. The best example of "why" is to enable
some simple code in the data step to provide a definition of the
originating datasets in each data step iteration.....
........something like this extension, which creates a $1 variable
that supports up to 8 sources
;
I've tested this trick on 7 tables (prepared earlier) :::>
option mprint;
%let joins = 7 ;
data b.food&joins ;
merge %mloop( max= &joins, pattern= out$( in=_in$) )
; by PROVIDER_ID ;
_in_ = input( cats( of _in: ), binary&joins.. );
format _in_ binary&joins..;
run ;
This created a nice column looking like
_in_
0000011
1111100
0000010
0000011
1111100
1111100
1000000
1000000
0000011
0000011
Since you are keen on a merge of the 150, this "_IN_" feature suggested
by "data _null_" in the last few days, need to be dropped in favour of
something that will indicate 150 sources and I think that the "binary."
format won't go that far.
Of course, Ed points out: the 150 counts need to be considered ~ and there
are more than 8 datasets.
So, lets use these counts to inticate the sources. Instead of deriving a
new column, rename the existing counts. That pattern= macro parameter
works for us, again !
data b.food&joins ;
merge %mloop( max= &joins, pattern= out$( rename= count= count$) )
; by PROVIDER_ID ;
run ;
Here is log from joining just 7
279 data b.food&joins ;
280 merge %mloop( max= &joins, pattern= out$( rename= count= count$) )
MPRINT(MLOOP): out1( rename= count= count1) out2( rename= count= count2)
out3( rename= count= count3) out4( rename= count= count4)
out5( rename= count= count5) out6( rename= count= count6) out7( rename
281 ; by PROVIDER_ID ;
282 run ;
NOTE: There were 11 observations read from the data set WORK.OUT1.
NOTE: There were 12 observations read from the data set WORK.OUT2.
NOTE: There were 11 observations read from the data set WORK.OUT3.
NOTE: There were 9 observations read from the data set WORK.OUT4.
NOTE: There were 8 observations read from the data set WORK.OUT5.
NOTE: There were 9 observations read from the data set WORK.OUT6.
NOTE: There were 14 observations read from the data set WORK.OUT7.
NOTE: The data set B.FOOD7 has 18 observations and 8 variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.04 seconds
Once it is built, deriving the number of contributing tables is simply
n( of count: )
The other stats are as easily available.
However, it might wise to consider some alternative data handling because
for 150 tables in the join, my merge created just 19 obs, and took 1.15
secs (instead of the demo above which took just .03 of one second).
It is likely the the "runtime to number of tables and rows" relationship
is not linear.
How long might a SAS data step merge take for a few thousand rows in each
of the 150 tables ?
If you want to try, here is the code I used to build my 150 small tables.
It was convenient to use that %mloop() again.
I expect you could simplify this.
data out0( keep= provider_id );
set sashelp.class( rename= name= provider_id );
run;
proc sort ; by provider_id ;
run;
%mloop( max=150, pattern= %str( data out$; set out0; if ranuni(&i) >.5 ;
count= int( ranuni(1)* 100 ); run; /* end of %str */ )
/* end of macro call */ )
PeterC
On Tue, 4 Mar 2008 15:45:44 -0500, Ed Heaton wrote:
> WAIT!!!!
>
> Your earlier email, Tom, said that you have only two variables - ID and
> COUNT - in each of these 150 datasets. Now, I presume ID is really
> PROVIDER_ID. So, your only other variable is these datasets exist in
> every one of them!!!
>
> RULE #1: Never perform a merge with like-named variables that are not
> listed in the BY statement.
>
> Suppose OUT1 and OUT2 each have a record where PROVIDER_ID=1001. For
> that record, let's assume OUT1 has COUNT=1 and OUT2 has COUNT=2. Then
> your output dataset will have a record for PROVIDER_ID=1001 with either
> COUNT=1 or COUNT=2. Which value do you want in COUNT? - the value from
> OUT1 or the value from OUT2? It might be hard to predict from your
> code.
>
> Now, you are merging 150 datasets!!! What value do you want in COUNT
> for your output dataset? - the value from OUT1 or the value from OUT2 or
> the value from OUT3 or the value from OUT4 or ... ?
>
> What, exactly, are you trying to do?
>
> Ed
>
> Edward Heaton, Senior Systems Analyst,
> Westat (An Employee-Owned Research Corporation),
> 1650 Research Boulevard, TB-286, Rockville, MD 20850-3195
> Voice: (301) 610-4818 Fax: (301) 294-2085
> mailto:EdHeaton@Westat.com http://www.Westat.com
>
>
> -----Original Message-----
> From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
> On Behalf Of Tom White
> Sent: Tuesday, March 04, 2008 2:50 PM
> To: data _null_,; SAS-L@listserv.uga.edu
> Subject: Re: Re: Sorting many small data sets
>
>
> Hello all again, Following the above post, now that I have sorted my
> datasets,I would like to merge them and save the resualt to dataset FOO.
> I try to modify Dennis' code but doesn't work:
>
> %macro mergeloop(max=);
>
> data b.foo;
>
> %do I = 1 %to &max;
>
> merge Out&I;
>
> by PROVIDER_ID;
>
> run;
>
> %end;
>
> %mend;
>
> %mergeloop(max=2); Help please. T
>
> ----- Original Message -----
> From: "data _null_,"
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Re: Sorting many small data sets
> Date: Tue, 4 Mar 2008 14:28:37 -0500
>
>
> There might be some benefit to using the data from the dictionary
> tables.
>
> data t1 t2 t100 t125 t130;
> do id = 1 to 10;
> output; output;
> end;
> run;
> data _null_;
> set sashelp.vmember(keep=libname memname
> where=(libname eq 'WORK' and memname eq: 'T'));
> sort = catx(' ','proc sort nodupkey data=',
> libname,'.',memname,'out=',libname,cats('.S',memname),'; by ID;
> run;');
> call execute(strip(sort));
> put (_all_)(=);
> run;
>
> On Tue, Mar 4, 2008 at 2:06 PM, Tom White wrote:
> > This one works. Thanks Dennis.
> >
> > Thanks everyone for responding to this so fast!
> >
> > T
> >
> >
> >
> >
> > ----- Original Message -----
> > From: "Diskin, Dennis"
> > To: "Tom White" , SAS-L@LISTSERV.UGA.EDU
> > Subject: RE: Sorting many small data sets
> > Date: Tue, 4 Mar 2008 14:00:36 -0500
> >
> >
> > %macro sortloop(max=);
> > %do I = 1 %to &max;
> > proc sort data=t&I nodupkey out=out&I;
> > by ID;
> > run;
> > %end;
> > %mend;
> >
> > %sortloop(max=150);
> >
> > -----Original Message-----
> > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf
> Of Tom
> > White
> > Sent: Tuesday, March 04, 2008 1:32 PM
> > To: SAS-L@LISTSERV.UGA.EDU
> > Subject: Sorting many small data sets
> >
> > Hello Everyone,
> >
> > I have many small (about 100K obs and 2 fields: ID and COUNT)
> datasets I
> > need to sort.
> >
> > Their names are: T1, T2, T3, ..., T150.
> >
> > I would like to do this:
> >
> > proc sort data=T1 NoDupKey out=out1;
> > by ID;
> > run;
> >
> > proc sort data=T2 NoDupKey out=out2;
> > by ID;
> > run;
> >
> > ...
> > ...
> >
> > proc sort data=T150 NoDupKey out=out150; by ID; run;
> >
> > How can I do this without all this work?
> >
> > Thank you.
> >
> > T
--
Want an e-mail address like mine?
Get a free e-mail account today at www.mail.com!