LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (July 2002, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 12 Jul 2002 01:41:06 +0100
Reply-To:     Peter Crawford <peter.crawford@DB.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Peter Crawford <peter.crawford@DB.COM>
Subject:      Re: Q concerning SQL -> SAS
Comments: To: Sigurd Hermansen <HERMANS1@WESTAT.com>
Content-type: text/plain; charset=iso-8859-1

Hi Sig my earlier prejudice of summary v datastep was based on experience using sum(), max() and min() in a datastep. Not the min and max operators we have here. Since seeing Paul D's evidence that the operator is faster than the function I haven't made any comparisons. I expect this 2 second margin may have been absorbed in the 2e6 function calls. My earlier posting showed runtimes on a big-ish aix box. What I didn't post was some stats from my sas v8.1 on winNT, .... 625 DATA Year_DATA2; 626 do while( not done ); 627 SET your.data(KEEP=Year) END = DONE; 628 Begin = Begin >< Year; 629 End = End <> Year; 630 end; 631 output; 632 stop; 633 RUN;

NOTE: There were 2000000 observations read from the data set YOUR.DATA. NOTE: The data set WORK.YEAR_DATA2 has 1 observations and 3 variables. NOTE: DATA statement used: real time 13.21 seconds cpu time 3.60 seconds

634 PROC SQL; 635 CREATE TABLE YEARS2 AS 636 SELECT MIN(Year) as begin, Max(Year) as ends 637 FROM your.data; NOTE: Table WORK.YEARS2 created, with 1 rows and 2 columns.

638 QUIT; NOTE: PROCEDURE SQL used: real time 13.42 seconds cpu time 4.10 seconds

639 proc summary missing data=your.data; 640 var year ; 641 output out=stats min(year)= begin max(year)=end ; 642 run;

NOTE: There were 2000000 observations read from the data set YOUR.DATA. NOTE: The data set WORK.STATS has 1 observations and 4 variables. NOTE: PROCEDURE SUMMARY used: real time 13.43 seconds cpu time 3.04 seconds

But this is only showing very small time differences between summary, sql and datastep

Running things a few hours later, the times have increased a little 656 DATA Year_DATA3; * with functions not operators !; 657 do while( not done ); 658 SET your.data(KEEP=Year) END = DONE; 659 Begin = min( Begin, Year); 660 End = max( End, Year ); 661 end; 662 output; 663 stop; 664 RUN;

NOTE: There were 2000000 observations read from the data set YOUR.DATA. NOTE: The data set WORK.YEAR_DATA3 has 1 observations and 3 variables. NOTE: DATA statement used: real time 14.96 seconds cpu time 2.65 seconds

665 PROC SQL; 666 CREATE TABLE YEARS2 AS 667 SELECT MIN(Year) as begin, Max(Year) as ends 668 FROM your.data; NOTE: Table WORK.YEARS2 created, with 1 rows and 2 columns.

669 QUIT; NOTE: PROCEDURE SQL used: real time 14.79 seconds cpu time 4.81 seconds

679 proc summary missing data=your.data; 680 var year ; 681 output out=stats min(year)= begin max(year)=end ; 682 run;

NOTE: There were 2000000 observations read from the data set YOUR.DATA. NOTE: The data set WORK.STATS has 1 observations and 4 variables. NOTE: PROCEDURE SUMMARY used: real time 14.71 seconds cpu time 3.01 seconds

Still very little to choose a _best_ alternative, because I think runtime is swamped by I/O I introduced min/max functions to investigate that possible reasons for my earlier prejudice... It seems too minor to be the "probable cause" ! (other aspects of my earlier comparisons: the dataset had >200 numeric variables; the datastep was using arrays to reduce syntax where proc summary might be more efficient with internal management of arrays; ....... but I don't feel inclined to go back there again ! see later )

Here, the cpu time relationship hints at more significant differences ... but that is very likely to change in the v9 future of ...... "multi-threading for better elapse time" !

later Peter

Ok I couldn't leave it until I tried summarising a dataset with 100 numeric variables !! Proc summary remains my favourite because it seems fastest when handlng many numeric vars. Although proc sql is almost as fast, proc summary needs no preliminary preparation of "wallpaper" syntax for each variable/statistic My old prejudice seems to be explained by the slow data step which used arrays to avoid "wallpaper" syntax. But that results in much more processing. When the array processing is replaced with sql-generated wallpaper syntax in macro variables, the run time fell from 22/20 (elapse/cpu) down to 13/12. It is very much closer to the sql and summary times, but not as good as either

method elapse cpu time poor datastep 22.21 / 20.39 proc sql 11.34 / 7.28 proc summary 10.57 / 5.48 wallpaper datastep 12.73 / 12.06

1092 data your.dat; 1093 retain c1 - c10 '12345' ; 1094 array years year1-year100; ; 1095 do _n_ = 1 to 2e5; 1096 do over years ; 1097 years = 1960 + ranuni(1) * 43 ; 1098 end; 1099 output; 1100 end; 1101 run;

NOTE: The data set YOUR.DAT has 200000 observations and 110 variables. NOTE: DATA statement used: real time 25.68 seconds cpu time 19.87 seconds

1102 DATA Year_DAT; 1103 SET your.dat END = DONE; 1104 array years year: ; 1105 array mins(100) ; 1106 array maxs(100) ; 1107 do over years; 1108 mins(_I_) = years min mins(_I_) ; 1109 maxs(_I_) = years max maxs(_I_) ; 1110 end; 1111 IF DONE ; 1112 drop year: ; 1113 RUN;

NOTE: There were 200000 observations read from the data set YOUR.DAT. NOTE: The data set WORK.YEAR_DAT has 1 observations and 210 variables. NOTE: DATA statement used: real time 22.21 seconds cpu time 20.39 seconds

1114 PROC SQL noprint ; 1115 select 'min( ' !! trim(name) !! ' ) as min_' !! trim(name) 1116 , 'max( ' !! trim(name) !! ' ) as max_' !! trim(name) 1117 into :mins separated by ', ' 1118 , :maxs separated by ', ' 1119 from dictionary.columns 1120 where libname = 'YOUR' and memname = 'DAT' and type = 'num' 1121 order by varnum ; NOTE: The query as specified involves ordering by an item that doesn't appear in its SELECT clause. 1122 CREATE TABLE YEARSX AS 1123 SELECT &mins, &maxs, count(*) as _freq_ SYMBOLGEN: Macro variable MINS resolves to min( year1 ) as min_year1, min( year2 ) ...... many lines removed min( year100 ) as min_year100 SYMBOLGEN: Macro variable MAXS resolves to max( year1 ) as max_year1, max( year2 ) ...... many lines removed max( year100 ) as max_year100 1124 from your.dat 1125 ; NOTE: Table WORK.YEARSX created, with 1 rows and 201 columns.

1126 QUIT; NOTE: PROCEDURE SQL used: real time 11.34 seconds cpu time 7.28 seconds

1127 proc summary missing data=your.dat; 1128 var _numeric_ ; 1129 output out=stats min()= max()= /autoname ; 1130 run;

NOTE: There were 200000 observations read from the data set YOUR.DAT. NOTE: The data set WORK.STATS has 1 observations and 202 variables. NOTE: PROCEDURE SUMMARY used: real time 10.57 seconds cpu time 5.48 seconds

1131 PROC SQL noprint ; 1132 select 'min_' !! trim(name) !! ' = min_' !! trim(name) !! ' min ' !! 1132! trim(name) 1133 , 'max_' !! trim(name) !! ' = max_' !! trim(name) !! ' max ' !! 1133! trim(name) 1134 into :mins separated by '; ' 1135 , :maxs separated by '; ' 1136 from dictionary.columns 1137 where libname = 'YOUR' and memname = 'DAT' and type = 'num' 1138 order by varnum ; NOTE: The query as specified involves ordering by an item that doesn't appear in its SELECT clause. 1139 quit; NOTE: PROCEDURE SQL used: real time 0.04 seconds cpu time 0.04 seconds

1140 data wallpaper; 1141 do until( end_dat ); 1142 set your.dat end=end_dat ; 1143 &mins ; SYMBOLGEN: Macro variable MINS resolves to min_year1 = min_year1 min year1; ...... many lines removed min_year100 = min_year100 min year100 1144 &maxs; SYMBOLGEN: Macro variable MAXS resolves to max_year1 = max_year1 max year1; ...... many lines removed max_year100 = max_year100 max year100 1145 end; 1146 output; 1147 stop; 1148 run;

NOTE: There were 200000 observations read from the data set YOUR.DAT. NOTE: The data set WORK.WALLPAPER has 1 observations and 310 variables. NOTE: DATA statement used: real time 12.73 seconds cpu time 12.06 seconds

Datum: 11.07.2002 22:50 An: Peter Crawford/Zentrale/DeuBaExt@Zentrale SAS-L@LISTSERV.UGA.EDU

Betreff: RE: Q concerning SQL -> SAS Nachrichtentext:

In my tests under MS Windows Version 8.2 the better data step and SQL solutions post almost identical times, while the SUMMARY lags by a couple of seconds. The consistency of the better data step and SQL solutions also held true in other tests. I wonder if it takes less time to open a WORK library file for output ??? Sig

34 35 data data; 36 retain c1 - c20 '12345' ; * just padding ; 37 do _n_ = 1 to 2e6; 38 year = year( ranuni(1) * "&sysdate9"d ) ; 39 output; 40 end; 41 run;

NOTE: The data set WORK.DATA has 2000000 observations and 21 variables. NOTE: DATA statement used: real time 2:27.26

42 43 proc summary missing data=data; 44 var year ; 45 output out=stats min(year)= begin max(year)=end ; 46 run;

NOTE: There were 2000000 observations read from the data set WORK.DATA. NOTE: The data set WORK.STATS has 1 observations and 4 variables. NOTE: PROCEDURE SUMMARY used: real time 17.78 seconds

47 PROC SQL; 48 CREATE TABLE YEARS AS 49 SELECT MIN(Year), Max(Year) 50 FROM (SELECT DISTINCT Year FROM data); NOTE: Table WORK.YEARS created, with 1 rows and 2 columns.

51 QUIT; NOTE: PROCEDURE SQL used: real time 31.85 seconds

52 DATA Year_DATA; 53 SET data(KEEP=Year) END = DONE; 54 RETAIN Begin 0 END 0; 55 Begin = Begin >< Year; 56 End = End <> Year; NOTE: The "<>" operator is interpreted as "MAX". 57 IF DONE THEN OUTPUT; 58 RUN;

NOTE: There were 2000000 observations read from the data set WORK.DATA. NOTE: The data set WORK.YEAR_DATA has 1 observations and 3 variables. NOTE: DATA statement used: real time 15.98 seconds

59 60 * I hoped simplifying the sql might speed it up; 61 62 PROC SQL; 63 CREATE TABLE YEARS2 AS 64 SELECT MIN(Year) as begin, Max(Year) as ends 65 FROM data; NOTE: Table WORK.YEARS2 created, with 1 rows and 2 columns.

66 QUIT; NOTE: PROCEDURE SQL used: real time 15.99 seconds

-----Original Message----- From: Peter Crawford [mailto:peter.crawford@DB.COM] Sent: Thursday, July 11, 2002 5:04 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Q concerning SQL -> SAS

I have found simple stats can produce faster from proc summary than a data step: Certainly proc summary was much faster than proc sql. For testing, the table your.data was created with 21 data your.data; 22 retain c1 - c20 '12345' ; * just padding ; 23 do _n_ = 1 to 2e6; 24 year = year( ranuni(1) * "&sysdate9"d ) ; 25 output; 26 end; 27 run;

NOTE: The data set YOUR.DATA has 2000000 observations and 21 variables. NOTE: DATA statement used: real time 13.38 seconds cpu time 9.12 seconds

28 proc summary missing data=your.data; 29 var year ; 30 output out=stats min(year)= begin max(year)=end ; 31 run;

NOTE: There were 2000000 observations read from the data set YOUR.DATA. NOTE: The data set WORK.STATS has 1 observations and 4 variables. NOTE: PROCEDURE SUMMARY used: real time 2.01 seconds cpu time 1.90 seconds

32 PROC SQL; 33 CREATE TABLE YEARS AS 34 SELECT MIN(Year), Max(Year) 35 FROM (SELECT DISTINCT Year FROM your.data); NOTE: Table WORK.YEARS created, with 1 rows and 2 columns.

36 QUIT; NOTE: PROCEDURE SQL used: real time 11.42 seconds cpu time 11.38 seconds

37 DATA Year_DATA; 38 SET your.data(KEEP=Year) END = DONE; 39 RETAIN Begin 0 END 0; 40 Begin = Begin >< Year; 41 End = End <> Year; 42 IF DONE THEN OUTPUT; 43 RUN;

NOTE: There were 2000000 observations read from the data set YOUR.DATA. NOTE: The data set WORK.YEAR_DATA has 1 observations and 3 variables. NOTE: DATA statement used: real time 2.22 seconds cpu time 2.19 seconds

* I hoped simplifying the sql might speed it up;

53 PROC SQL; 54 CREATE TABLE YEARS2 AS 55 SELECT MIN(Year) as begin, Max(Year) as ends 56 FROM your.data; NOTE: Table WORK.YEARS2 created, with 1 rows and 2 columns.

57 QUIT; NOTE: PROCEDURE SQL used: real time 3.79 seconds cpu time 3.77 seconds

But although the 3.79 secs is very much better than 11.42 secs, it is still noticeably slower than 2.01

Datum: 11.07.2002 17:56 An: SAS-L@LISTSERV.UGA.EDU

Antwort an: Witness <bmeyer67@CALVIN.EDU>

Betreff: Re: Q concerning SQL -> SAS Nachrichtentext:

Thanks. I put this into my test program (removing the old stuff), and here's the timing I got, in case anyone is interested:

DATAstep PROC SQL Real: 36.14 40.69 CPU: 4.79 10.10

Note: this is on my official data. SAS reports 2631466 observations for the DATA step. (It doesn't mention for the SQL.) And here's the code:

DATA Year_DATA; SET DB.myTable(KEEP=Year) END = DONE; RETAIN Begin 0 END 0; Begin = Begin >< Year; End = End <> Year; IF DONE THEN OUTPUT; RUN; PROC SQL; CREATE TABLE YEARS AS SELECT MIN(Year) AS Begin, MAX(Year) AS End FROM (SELECT DISTINCT Year FROM DB.myTable); QUIT;

Thanks!

Benjamen R. Meyer

P.S. With the removal of the SELECT DISTINCT query, the SQL is about as fast as the DATA step. Here's the timing: Real: 36.28 CPU: 5.50 Just for those curious.

> Here is a data step solution: > > data years; > set test end = done; /*substitute your data set name here*/ > retain minyear 0 maxyear 0; > minyear = minyear >< year; > maxyear = maxyear <> year; > if done then output; > run; > > And here is the equivelant SQL: > > CREATE TABLE YEARS AS > SELECT MIN(Year), Max(Year) > > After generating some test data containing duplicate years > from 1965 to > 2002, I ran all three (including the select with the > distinct) and got: > > With Select Distinct 09:58 Thursday, > July 11, 2002 > 1 > > Obs _TEMG001 _TEMG002 > > 1 1965 2002 > Without Select Distinct 09:58 Thursday, > July 11, 2002 > 2 > > Obs _TEMG001 _TEMG002 > > 1 1965 2002 > Data Step Version 09:58 Thursday, > July 11, 2002 > 3 > > Obs year minyear maxyear > > 1 2002 0 2002 > > The tiny data set I used didn't produce meaningful timings. > However, just > getting rid of the sort is the big winner. > > Jonathan Goldberg > Missouri Alcoholism Research Center > Washington University School of Medicine > 40 N. Kingshighway, Suite One > St. Louis, MO 63108 > 314-286-2212 > >

--

--

Diese E-Mail enthält vertrauliche und/oder rechtlich geschützte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrtümlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Mail ist nicht gestattet.

This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and destroy this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden.


Back to: Top of message | Previous page | Main SAS-L page