Date: Fri, 12 Jul 2002 01:41:06 +0100
Reply-To: Peter Crawford <peter.crawford@DB.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Peter Crawford <peter.crawford@DB.COM>
Subject: Re: Q concerning SQL -> SAS
Content-type: text/plain; charset=iso-8859-1
Hi Sig
my earlier prejudice of summary v datastep was based
on experience using sum(), max() and min() in a datastep.
Not the min and max operators we have here.
Since seeing Paul D's evidence that the operator is faster
than the function I haven't made any comparisons. I expect
this 2 second margin may have been absorbed in the 2e6
function calls.
My earlier posting showed runtimes on a big-ish aix box.
What I didn't post was some stats from my sas v8.1 on winNT, ....
625 DATA Year_DATA2;
626 do while( not done );
627 SET your.data(KEEP=Year) END = DONE;
628 Begin = Begin >< Year;
629 End = End <> Year;
630 end;
631 output;
632 stop;
633 RUN;
NOTE: There were 2000000 observations read from the data set YOUR.DATA.
NOTE: The data set WORK.YEAR_DATA2 has 1 observations and 3 variables.
NOTE: DATA statement used:
real time 13.21 seconds
cpu time 3.60 seconds
634 PROC SQL;
635 CREATE TABLE YEARS2 AS
636 SELECT MIN(Year) as begin, Max(Year) as ends
637 FROM your.data;
NOTE: Table WORK.YEARS2 created, with 1 rows and 2 columns.
638 QUIT;
NOTE: PROCEDURE SQL used:
real time 13.42 seconds
cpu time 4.10 seconds
639 proc summary missing data=your.data;
640 var year ;
641 output out=stats min(year)= begin max(year)=end ;
642 run;
NOTE: There were 2000000 observations read from the data set YOUR.DATA.
NOTE: The data set WORK.STATS has 1 observations and 4 variables.
NOTE: PROCEDURE SUMMARY used:
real time 13.43 seconds
cpu time 3.04 seconds
But this is only showing very small time differences
between summary, sql and datastep
Running things a few hours later, the times have increased a little
656 DATA Year_DATA3; * with functions not operators !;
657 do while( not done );
658 SET your.data(KEEP=Year) END = DONE;
659 Begin = min( Begin, Year);
660 End = max( End, Year );
661 end;
662 output;
663 stop;
664 RUN;
NOTE: There were 2000000 observations read from the data set YOUR.DATA.
NOTE: The data set WORK.YEAR_DATA3 has 1 observations and 3 variables.
NOTE: DATA statement used:
real time 14.96 seconds
cpu time 2.65 seconds
665 PROC SQL;
666 CREATE TABLE YEARS2 AS
667 SELECT MIN(Year) as begin, Max(Year) as ends
668 FROM your.data;
NOTE: Table WORK.YEARS2 created, with 1 rows and 2 columns.
669 QUIT;
NOTE: PROCEDURE SQL used:
real time 14.79 seconds
cpu time 4.81 seconds
679 proc summary missing data=your.data;
680 var year ;
681 output out=stats min(year)= begin max(year)=end ;
682 run;
NOTE: There were 2000000 observations read from the data set YOUR.DATA.
NOTE: The data set WORK.STATS has 1 observations and 4 variables.
NOTE: PROCEDURE SUMMARY used:
real time 14.71 seconds
cpu time 3.01 seconds
Still very little to choose a _best_ alternative, because I think
runtime is swamped by I/O
I introduced min/max functions to investigate that
possible reasons for my earlier prejudice...
It seems too minor to be the "probable cause" !
(other aspects of my earlier comparisons:
the dataset had >200 numeric variables;
the datastep was using arrays to reduce syntax
where proc summary might be more efficient
with internal management of arrays; .......
but I don't feel inclined to go back there again !
see later )
Here, the cpu time relationship hints at more significant
differences ... but that is very likely to change in the v9 future of
...... "multi-threading for better elapse time" !
later
Peter
Ok I couldn't leave it until I tried summarising
a dataset with 100 numeric variables !!
Proc summary remains my favourite because
it seems fastest when handlng many numeric vars.
Although proc sql is almost as fast, proc summary
needs no preliminary preparation of "wallpaper" syntax
for each variable/statistic
My old prejudice seems to be explained by the slow data step
which used arrays to avoid "wallpaper" syntax. But that results
in much more processing. When the array processing
is replaced with sql-generated wallpaper syntax in
macro variables, the run time fell from 22/20 (elapse/cpu)
down to 13/12. It is very much closer to the sql and summary
times, but not as good as either
method elapse cpu time
poor datastep 22.21 / 20.39
proc sql 11.34 / 7.28
proc summary 10.57 / 5.48
wallpaper datastep 12.73 / 12.06
1092 data your.dat;
1093 retain c1 - c10 '12345' ;
1094 array years year1-year100; ;
1095 do _n_ = 1 to 2e5;
1096 do over years ;
1097 years = 1960 + ranuni(1) * 43 ;
1098 end;
1099 output;
1100 end;
1101 run;
NOTE: The data set YOUR.DAT has 200000 observations and 110 variables.
NOTE: DATA statement used:
real time 25.68 seconds
cpu time 19.87 seconds
1102 DATA Year_DAT;
1103 SET your.dat END = DONE;
1104 array years year: ;
1105 array mins(100) ;
1106 array maxs(100) ;
1107 do over years;
1108 mins(_I_) = years min mins(_I_) ;
1109 maxs(_I_) = years max maxs(_I_) ;
1110 end;
1111 IF DONE ;
1112 drop year: ;
1113 RUN;
NOTE: There were 200000 observations read from the data set YOUR.DAT.
NOTE: The data set WORK.YEAR_DAT has 1 observations and 210 variables.
NOTE: DATA statement used:
real time 22.21 seconds
cpu time 20.39 seconds
1114 PROC SQL noprint ;
1115 select 'min( ' !! trim(name) !! ' ) as min_' !! trim(name)
1116 , 'max( ' !! trim(name) !! ' ) as max_' !! trim(name)
1117 into :mins separated by ', '
1118 , :maxs separated by ', '
1119 from dictionary.columns
1120 where libname = 'YOUR' and memname = 'DAT' and type = 'num'
1121 order by varnum ;
NOTE: The query as specified involves ordering by an item that doesn't appear in its
SELECT clause.
1122 CREATE TABLE YEARSX AS
1123 SELECT &mins, &maxs, count(*) as _freq_
SYMBOLGEN: Macro variable MINS resolves to min( year1 ) as min_year1, min( year2 )
...... many lines removed
min( year100 ) as min_year100
SYMBOLGEN: Macro variable MAXS resolves to max( year1 ) as max_year1, max( year2 )
...... many lines removed
max( year100 ) as max_year100
1124 from your.dat
1125 ;
NOTE: Table WORK.YEARSX created, with 1 rows and 201 columns.
1126 QUIT;
NOTE: PROCEDURE SQL used:
real time 11.34 seconds
cpu time 7.28 seconds
1127 proc summary missing data=your.dat;
1128 var _numeric_ ;
1129 output out=stats min()= max()= /autoname ;
1130 run;
NOTE: There were 200000 observations read from the data set YOUR.DAT.
NOTE: The data set WORK.STATS has 1 observations and 202 variables.
NOTE: PROCEDURE SUMMARY used:
real time 10.57 seconds
cpu time 5.48 seconds
1131 PROC SQL noprint ;
1132 select 'min_' !! trim(name) !! ' = min_' !! trim(name) !! ' min ' !!
1132! trim(name)
1133 , 'max_' !! trim(name) !! ' = max_' !! trim(name) !! ' max ' !!
1133! trim(name)
1134 into :mins separated by '; '
1135 , :maxs separated by '; '
1136 from dictionary.columns
1137 where libname = 'YOUR' and memname = 'DAT' and type = 'num'
1138 order by varnum ;
NOTE: The query as specified involves ordering by an item that doesn't appear in its
SELECT clause.
1139 quit;
NOTE: PROCEDURE SQL used:
real time 0.04 seconds
cpu time 0.04 seconds
1140 data wallpaper;
1141 do until( end_dat );
1142 set your.dat end=end_dat ;
1143 &mins ;
SYMBOLGEN: Macro variable MINS resolves to min_year1 = min_year1 min year1;
...... many lines removed
min_year100 = min_year100 min year100
1144 &maxs;
SYMBOLGEN: Macro variable MAXS resolves to max_year1 = max_year1 max year1;
...... many lines removed
max_year100 = max_year100 max year100
1145 end;
1146 output;
1147 stop;
1148 run;
NOTE: There were 200000 observations read from the data set YOUR.DAT.
NOTE: The data set WORK.WALLPAPER has 1 observations and 310 variables.
NOTE: DATA statement used:
real time 12.73 seconds
cpu time 12.06 seconds
Datum: 11.07.2002 22:50
An: Peter Crawford/Zentrale/DeuBaExt@Zentrale
SAS-L@LISTSERV.UGA.EDU
Betreff: RE: Q concerning SQL -> SAS
Nachrichtentext:
In my tests under MS Windows Version 8.2 the better data step and SQL
solutions post almost identical times, while the SUMMARY lags by a couple of
seconds. The consistency of the better data step and SQL solutions also held
true in other tests. I wonder if it takes less time to open a WORK library
file for output ???
Sig
34
35 data data;
36 retain c1 - c20 '12345' ; * just padding ;
37 do _n_ = 1 to 2e6;
38 year = year( ranuni(1) * "&sysdate9"d ) ;
39 output;
40 end;
41 run;
NOTE: The data set WORK.DATA has 2000000 observations and 21 variables.
NOTE: DATA statement used:
real time 2:27.26
42
43 proc summary missing data=data;
44 var year ;
45 output out=stats min(year)= begin max(year)=end ;
46 run;
NOTE: There were 2000000 observations read from the data set WORK.DATA.
NOTE: The data set WORK.STATS has 1 observations and 4 variables.
NOTE: PROCEDURE SUMMARY used:
real time 17.78 seconds
47 PROC SQL;
48 CREATE TABLE YEARS AS
49 SELECT MIN(Year), Max(Year)
50 FROM (SELECT DISTINCT Year FROM data);
NOTE: Table WORK.YEARS created, with 1 rows and 2 columns.
51 QUIT;
NOTE: PROCEDURE SQL used:
real time 31.85 seconds
52 DATA Year_DATA;
53 SET data(KEEP=Year) END = DONE;
54 RETAIN Begin 0 END 0;
55 Begin = Begin >< Year;
56 End = End <> Year;
NOTE: The "<>" operator is interpreted as "MAX".
57 IF DONE THEN OUTPUT;
58 RUN;
NOTE: There were 2000000 observations read from the data set WORK.DATA.
NOTE: The data set WORK.YEAR_DATA has 1 observations and 3 variables.
NOTE: DATA statement used:
real time 15.98 seconds
59
60 * I hoped simplifying the sql might speed it up;
61
62 PROC SQL;
63 CREATE TABLE YEARS2 AS
64 SELECT MIN(Year) as begin, Max(Year) as ends
65 FROM data;
NOTE: Table WORK.YEARS2 created, with 1 rows and 2 columns.
66 QUIT;
NOTE: PROCEDURE SQL used:
real time 15.99 seconds
-----Original Message-----
From: Peter Crawford [mailto:peter.crawford@DB.COM]
Sent: Thursday, July 11, 2002 5:04 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Q concerning SQL -> SAS
I have found simple stats can produce faster from proc summary
than a data step:
Certainly proc summary was much faster than proc sql.
For testing, the table your.data was created with
21 data your.data;
22 retain c1 - c20 '12345' ; * just padding ;
23 do _n_ = 1 to 2e6;
24 year = year( ranuni(1) * "&sysdate9"d ) ;
25 output;
26 end;
27 run;
NOTE: The data set YOUR.DATA has 2000000 observations and 21 variables.
NOTE: DATA statement used:
real time 13.38 seconds
cpu time 9.12 seconds
28 proc summary missing data=your.data;
29 var year ;
30 output out=stats min(year)= begin max(year)=end ;
31 run;
NOTE: There were 2000000 observations read from the data set YOUR.DATA.
NOTE: The data set WORK.STATS has 1 observations and 4 variables.
NOTE: PROCEDURE SUMMARY used:
real time 2.01 seconds
cpu time 1.90 seconds
32 PROC SQL;
33 CREATE TABLE YEARS AS
34 SELECT MIN(Year), Max(Year)
35 FROM (SELECT DISTINCT Year FROM your.data);
NOTE: Table WORK.YEARS created, with 1 rows and 2 columns.
36 QUIT;
NOTE: PROCEDURE SQL used:
real time 11.42 seconds
cpu time 11.38 seconds
37 DATA Year_DATA;
38 SET your.data(KEEP=Year) END = DONE;
39 RETAIN Begin 0 END 0;
40 Begin = Begin >< Year;
41 End = End <> Year;
42 IF DONE THEN OUTPUT;
43 RUN;
NOTE: There were 2000000 observations read from the data set YOUR.DATA.
NOTE: The data set WORK.YEAR_DATA has 1 observations and 3 variables.
NOTE: DATA statement used:
real time 2.22 seconds
cpu time 2.19 seconds
* I hoped simplifying the sql might speed it up;
53 PROC SQL;
54 CREATE TABLE YEARS2 AS
55 SELECT MIN(Year) as begin, Max(Year) as ends
56 FROM your.data;
NOTE: Table WORK.YEARS2 created, with 1 rows and 2 columns.
57 QUIT;
NOTE: PROCEDURE SQL used:
real time 3.79 seconds
cpu time 3.77 seconds
But although the 3.79 secs is very much better than 11.42 secs,
it is still noticeably slower than 2.01
Datum: 11.07.2002 17:56
An: SAS-L@LISTSERV.UGA.EDU
Antwort an: Witness <bmeyer67@CALVIN.EDU>
Betreff: Re: Q concerning SQL -> SAS
Nachrichtentext:
Thanks. I put this into my test program (removing the old stuff), and
here's the timing I got, in case anyone is interested:
DATAstep PROC SQL
Real: 36.14 40.69
CPU: 4.79 10.10
Note: this is on my official data. SAS reports 2631466 observations for
the DATA step. (It doesn't mention for the SQL.)
And here's the code:
DATA Year_DATA;
SET DB.myTable(KEEP=Year) END = DONE;
RETAIN Begin 0 END 0;
Begin = Begin >< Year;
End = End <> Year;
IF DONE THEN OUTPUT;
RUN;
PROC SQL;
CREATE TABLE YEARS AS
SELECT MIN(Year) AS Begin, MAX(Year) AS End
FROM (SELECT DISTINCT Year FROM DB.myTable);
QUIT;
Thanks!
Benjamen R. Meyer
P.S. With the removal of the SELECT DISTINCT query, the SQL is about as
fast as the DATA step. Here's the timing:
Real: 36.28
CPU: 5.50
Just for those curious.
> Here is a data step solution:
>
> data years;
> set test end = done; /*substitute your data set name here*/
> retain minyear 0 maxyear 0;
> minyear = minyear >< year;
> maxyear = maxyear <> year;
> if done then output;
> run;
>
> And here is the equivelant SQL:
>
> CREATE TABLE YEARS AS
> SELECT MIN(Year), Max(Year)
>
> After generating some test data containing duplicate years
> from 1965 to
> 2002, I ran all three (including the select with the
> distinct) and got:
>
> With Select Distinct 09:58 Thursday,
> July 11, 2002
> 1
>
> Obs _TEMG001 _TEMG002
>
> 1 1965 2002
> Without Select Distinct 09:58 Thursday,
> July 11, 2002
> 2
>
> Obs _TEMG001 _TEMG002
>
> 1 1965 2002
> Data Step Version 09:58 Thursday,
> July 11, 2002
> 3
>
> Obs year minyear maxyear
>
> 1 2002 0 2002
>
> The tiny data set I used didn't produce meaningful timings.
> However, just
> getting rid of the sort is the big winner.
>
> Jonathan Goldberg
> Missouri Alcoholism Research Center
> Washington University School of Medicine
> 40 N. Kingshighway, Suite One
> St. Louis, MO 63108
> 314-286-2212
>
>
--
--
Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet.
This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.