| Date: | Thu, 11 Aug 2011 15:44:53 -0700 |
| Reply-To: | Jack Hamilton <jfh@STANFORDALUMNI.ORG> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Jack Hamilton <jfh@STANFORDALUMNI.ORG> |
| Subject: | Re: Architectural approach for high performance transpose |
|
| In-Reply-To: | <CAEZCysvfxJEMWs4B75UWOaG82yJLMe7PVRSKHa6i6Byh+Ry-3Q@mail.gmail.com> |
| Content-Type: | text/plain; charset=UTF-8 |
Without commenting on the rest of the code, I have a suggestion for making the creation of the format faster.
Instead of
data control;
retain fmtname 'VI' type 'I' hlo 'UJ';
set id;
start = id;
label = _level_;
run;
use
data control (sortedby=id) / view = control;
retain fmtname 'VI' type 'I' hlo 'UJ';
set id;
rename start = id;
rename label = _level_;
run;
These changes reduce I/O, and telling PROC FORMAT that the input is sorted lets it build its tree more efficiently.
--
jfh
"Data _null_;" <iebupdte@GMAIL.COM> wrote:
After thinking about Nat's comment I realized that the program needs
to reset on first.(lastbyvar) and output on last.(lastbyvar). Using
Scott's test data KEY2 and ID are the same so you get the same result
if you reset and output on every obs or (lastbyvar). The are the
same.
Here is the corrected code. This type of mistake is a good reason to
bite the performance bullet and use the double transpose.
options fullstimer=1;
data source;
length key1 key2 8 id $1 var1 var2 8;
do key1=1 to 1e6; * be careful with this.
do key2=1 to 6;
do _n_ = 1 to 6;
id=byte(64+_n_);
/* id=byte(64+key2);*/
var1=int(ranuni(1)*1000);
var2=int(ranuni(1)*1000);
output;
end;
end;
end;
run;
* set by and var variables for proc transpose ;
* these will be used later for post-processing ;
%let by=key1 key2;
%let var=var1 var2;
proc summary data=source(keep=id) nway;
class id;
output out=id(drop=_type_ _freq_) / levels;
run;
data control;
retain fmtname 'VI' type 'I' hlo 'UJ';
set id;
start = id;
label = _level_;
run;
proc format cntlin=control;
run;
proc transpose data=source(obs=0) out=vars;
var &var;
run;
proc sql;
create table trnames as select _name_,id, . as dot from vars,id order by id;
run;
proc transpose data=trnames out=arrays(drop=_name_) delimiter=_;
id _name_ id;
var dot;
run;
filename FT67F001 temp;
data _null_;
file FT67F001;
set vars;
put +3 'array v' _n_ '[*] ' _name_ +(-1) '_:;';
put +3 "if first.%scan(&by,-1) then call missing(of v" _n_ +(-1) '[*]);';
put +3 'v' _n_ '[i] = ' _name_ +(-1) ';';
run;
data codegen;
if 0 then set source(keep=&by);
drop &var;
if 0 then set arrays;
set source;
by &by;
i = input(id,vi.);
%inc FT67F001 / source;
if last.%scan(&by,-1) then output;
drop i id;
run;
proc transpose data=source out=flip;
by &by id;
var &var;
run;
proc transpose data=flip out=doubletranspose(drop=_name_) DELIMITER=_;
by &by;
id _name_ id;
var col1;
run;
proc compare base=doubletranspose compare=codegen note list listequalvars;
run;
785 options fullstimer=1;
786 data source;
787 length key1 key2 8 id $1 var1 var2 8;
788 do key1=1 to 1e6;
789 do key2=1 to 6;
790 do _n_ = 1 to 6;
791 id=byte(64+_n_);
792 /* id=byte(64+key2);*/
793 var1=int(ranuni(1)*1000);
794 var2=int(ranuni(1)*1000);
795 output;
796 end;
797 end;
798 end;
799 run;
NOTE: The data set WORK.SOURCE has 36000000 observations and 5 variables.
NOTE: DATA statement used (Total process time):
real time 33.47 seconds
user cpu time 8.45 seconds
system cpu time 6.34 seconds
Memory 189k
OS Memory 8824k
Timestamp 8/11/2011 12:12:02 AM
800 * set by and var variables for proc transpose ;
801 * these will be used later for post-processing ;
802
803 %let by=key1 key2;
804 %let var=var1 var2;
805
806 proc summary data=source(keep=id) nway;
807 class id;
808 output out=id(drop=_type_ _freq_) / levels;
809 run;
NOTE: There were 36000000 observations read from the data set WORK.SOURCE.
NOTE: The data set WORK.ID has 6 observations and 2 variables.
NOTE: PROCEDURE SUMMARY used (Total process time):
real time 37.97 seconds
user cpu time 9.09 seconds
system cpu time 2.16 seconds
Memory 8777k
OS Memory 16824k
Timestamp 8/11/2011 12:12:40 AM
810 data control;
811 retain fmtname 'VI' type 'I' hlo 'UJ';
812 set id;
813 start = id;
814 label = _level_;
815 run;
NOTE: There were 6 observations read from the data set WORK.ID.
NOTE: The data set WORK.CONTROL has 6 observations and 7 variables.
NOTE: DATA statement used (Total process time):
real time 0.01 seconds
user cpu time 0.01 seconds
system cpu time 0.00 seconds
Memory 184k
OS Memory 8824k
Timestamp 8/11/2011 12:12:40 AM
816 proc format cntlin=control;
NOTE: Informat VI has been output.
817 run;
NOTE: PROCEDURE FORMAT used (Total process time):
real time 0.04 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
Memory 105k
OS Memory 8824k
Timestamp 8/11/2011 12:12:40 AM
NOTE: There were 6 observations read from the data set WORK.CONTROL.
818
819 proc transpose data=source(obs=0) out=vars;
820 var &var;
821 run;
NOTE: There were 0 observations read from the data set WORK.SOURCE.
NOTE: The data set WORK.VARS has 2 observations and 1 variables.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
real time 0.06 seconds
user cpu time 0.00 seconds
system cpu time 0.00 seconds
Memory 2145k
OS Memory 10876k
Timestamp 8/11/2011 12:12:40 AM
822 proc sql;
823 create table trnames as select _name_,id, . as dot from
vars,id order by id;
NOTE: The execution of this query involves performing one or more
Cartesian product joins that can not be optimized.
NOTE: Table WORK.TRNAMES created, with 12 rows and 3 columns.
824 run;
NOTE: PROC SQL statements are executed immediately; The RUN statement
has no effect.
NOTE: PROCEDURE SQL used (Total process time):
real time 0.51 seconds
user cpu time 0.00 seconds
system cpu time 0.01 seconds
Memory 3469k
OS Memory 10884k
Timestamp 8/11/2011 12:12:40 AM
825 proc transpose data=trnames out=arrays(drop=_name_) delimiter=_;
826 id _name_ id;
827 var dot;
828 run;
NOTE: There were 12 observations read from the data set WORK.TRNAMES.
NOTE: The data set WORK.ARRAYS has 1 observations and 12 variables.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
real time 0.01 seconds
user cpu time 0.01 seconds
system cpu time 0.00 seconds
Memory 2142k
OS Memory 10876k
Timestamp 8/11/2011 12:12:40 AM
829 filename FT67F001 temp;
830 data _null_;
831 file FT67F001;
832 set vars;
833 put +3 'array v' _n_ '[*] ' _name_ +(-1) '_:;';
834 put +3 "if first.%scan(&by,-1) then call missing(of v" _n_
+(-1) '[*]);';
835 put +3 'v' _n_ '[i] = ' _name_ +(-1) ';';
836 run;
NOTE: The file FT67F001 is:
Filename=C:\Users\zz\AppData\Local\Temp\SAS Temporary
Files\_TD4984\#LN00041,
RECFM=V,LRECL=256,File Size (bytes)=0,
Last Modified=11Aug2011:00:12:40,
Create Time=11Aug2011:00:12:40
NOTE: 6 records were written to the file FT67F001.
The minimum record length was 17.
The maximum record length was 45.
NOTE: There were 2 observations read from the data set WORK.VARS.
NOTE: DATA statement used (Total process time):
real time 0.04 seconds
user cpu time 0.01 seconds
system cpu time 0.00 seconds
Memory 230k
OS Memory 8824k
Timestamp 8/11/2011 12:12:40 AM
837
838 data codegen;
839 if 0 then set source(keep=&by);
840 drop &var;
841 if 0 then set arrays;
842 set source;
843 by &by;
844 i = input(id,vi.);
845 %inc FT67F001 / source;
NOTE: %INCLUDE (level 1) file FT67F001 is file
C:\Users\zz\AppData\Local\Temp\SAS Temporary Files\_TD4984\#LN00041.
846 + array v1 [*] var1_:;
847 + if first.key2 then call missing(of v1[*]);
848 + v1 [i] = var1;
849 + array v2 [*] var2_:;
850 + if first.key2 then call missing(of v2[*]);
851 + v2 [i] = var2;
NOTE: %INCLUDE (level 1) ending.
852 if last.%scan(&by,-1) then output;
853 drop i id;
854 run;
NOTE: There were 36000000 observations read from the data set WORK.SOURCE.
NOTE: The data set WORK.CODEGEN has 6000000 observations and 14 variables.
NOTE: DATA statement used (Total process time):
real time 21.93 seconds
user cpu time 14.14 seconds
system cpu time 3.41 seconds
Memory 303k
OS Memory 8824k
Timestamp 8/11/2011 12:13:02 AM
855
856 proc transpose data=source out=flip;
857 by &by id;
858 var &var;
859 run;
NOTE: There were 36000000 observations read from the data set WORK.SOURCE.
NOTE: The data set WORK.FLIP has 72000000 observations and 5 variables.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
real time 3:09.30
user cpu time 1:00.06
system cpu time 17.95 seconds
Memory 2164k
OS Memory 10876k
Timestamp 8/11/2011 12:16:12 AM
860 proc transpose data=flip out=doubletranspose(drop=_name_) DELIMITER=_;
861 by &by;
862 id _name_ id;
863 var col1;
864 run;
NOTE: There were 72000000 observations read from the data set WORK.FLIP.
NOTE: The data set WORK.DOUBLETRANSPOSE has 6000000 observations and
14 variables.
NOTE: PROCEDURE TRANSPOSE used (Total process time):
real time 8:08.48
user cpu time 4:37.29
system cpu time 20.56 seconds
Memory 2172k
OS Memory 10876k
Timestamp 8/11/2011 12:24:20 AM
865
866 proc compare base=doubletranspose compare=codegen note list listequalvars;
867 run;
NOTE: No unequal values were found. All values compared are exactly equal.
NOTE: The data sets WORK.DOUBLETRANSPOSE and WORK.CODEGEN are equal in
all respects.
NOTE: There were 6000000 observations read from the data set
WORK.DOUBLETRANSPOSE.
NOTE: There were 6000000 observations read from the data set WORK.CODEGEN.
NOTE: PROCEDURE COMPARE used (Total process time):
real time 43.55 seconds
user cpu time 4.33 seconds
system cpu time 0.87 seconds
Memory 274k
OS Memory 8824k
Timestamp 8/11/2011 12:25:04 AM
|