LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (April 2005, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 27 Apr 2005 14:47:52 -0700
Reply-To:     paul.dorfman@FCSO.COM
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         "Dorfman, Paul" <paul.dorfman@FCSO.COM>
Organization: http://groups.google.com
Subject:      Re: Compound indexes and proc sql?
Comments: To: sas-l@uga.edu
In-Reply-To:  <slrnd6soo0.fhv.rs@mwilson.umlcoop.net>
Content-Type: text/plain; charset="iso-8859-1"

Rusty,

There is nothing ugly about your code - it is a sensible SAS-centric way of asking the software exactly what you need based on the specific knowledge about data organization (i.e. the presence of the index). Au contraire, when you sumbit an SQL query, you also submit to the mercy of the SQL optimizer. Luckily, in the scenario you have decribed, it is likely to choose the fast way, albeit it will be still a bit slower than the Data step. Let us give it a test shot:

28 data driver (keep = k1 k2) indexed (index = (kk = (k1 k2)) keep = k1 k2 data) ; 29 do _n_ = 1 to 1e6 ;

30 k1 ++ ranuni (1) ;

31 k2 ++ ranuni (1) ;

32 data ++ 1 ;

33 if ranuni (1) < 0.01 then output driver ;

34 output indexed ;

35 end ;

36 run ;

NOTE: The data set USER.DRIVER has 10030 observations and 2 variables.

NOTE: The data set USER.INDEXED has 1000000 observations and 3 variables. INFO: Multiple concurrent threads will be used to create the index.

NOTE: Composite index kk has been defined.

NOTE: DATA statement used (Total process time):

real time 3.31 seconds

cpu time 4.52 seconds

37

38 data _null_ ;

39 set driver ;

40 set indexed key = kk ;

41 if _iorc_ = 0 then output ;

42 else _error_ = 0 ;

43 run ;

NOTE: There were 10030 observations read from the data set USER.DRIVER.

NOTE: DATA statement used (Total process time):

real time 0.17 seconds

cpu time 0.17 seconds

44

45 option msglevel = i ;

46

47 proc sql _method ;

48 create table _null_ as

49 select x.k1, x.k1, x.data

50 from driver d, indexed x

51 where x.k1 = d.k1

52 and x.k2 = d.k2

53 ;

INFO: Index kk of SQL table USER.INDEXED (alias = X) selected for SQL WHERE clause (join) optimization.

NOTE: SQL execution methods chosen are:

sqxcrta

sqxjndx

sqxsrc( USER.DRIVER(alias = D) )

sqxsrc( USER.INDEXED(alias = X) )

54 quit ; NOTE: PROCEDURE SQL used (Total process time): real time 0.36 seconds cpu time 0.22 seconds

The SQXJNDX (and the preceding message in the log) indicate that the optimizer has elected to use the index. By varying the cutoff in the expression ranuni (1) < 0.01, I found that the optimized switched to SQXSORT access method between 0.13 and 0.14. Since almost every possible detail of this sort can now be found in the monumental (and the first of a kind) paper on the SAS SQL optimizer presented at SUGI 30 by Russel Lavery. Here I will only note that the workings and logic of the SAS optimizer differ quite radically from those in DB2 and Oracle - as it should for at least two reasons. On one hand, at sequential read, SAS is faster than either to the point of disbelief. For example, in one of my real-life production scenarios, about 10m rows are read from an UDB table (under AIX), and it takes 27:16.38 real time. SAS sorts the result (i.e. reads, sorts it, and writes) in 1:47.98. Such a disparity may explain why the SAS optimizer seem to lean heavily towards sequential processing. On the other hand, I suspect that since consumers of SAS SQL are mostly "users" rather than, say, highly SQL-versed DBAs, a decision cold have been made to optimize joins rather than subqueries. In both DB2 and Oracle, a subquery on the scenario above are much faster than an equivalent join. For example, in UDB, the subquery below executed about 10 times faster than the join shown above:

proc sql _method ; create table _null_ as select x.k1, x.k1, data from indexed x where exists (select 1 from driver d where x.k1 = d.k1 and x.k2 = d.k2) ; quit ;

Among DBAs, it is considered a staple test on efficiency proficiency. However, the SAS optimizer would not even think of optimizing stuff like this. At least, I do not have enough patience to wait until this ends, while the join runs almost instantaneously in the scant 3 seconds. Interestingly, I have created two test tables just like above in DB2 and ran the subquery above against them on the same server. It took about 7 seconds to finish, i.e. twice as long as the SAS index join.

Kind regards, --------------------------- Paul M. Dorfman Jacksonville, FL ---------------------------


Back to: Top of message | Previous page | Main SAS-L page