Date: Wed, 27 Apr 2005 14:47:52 -0700
Reply-To: paul.dorfman@FCSO.COM
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "Dorfman, Paul" <paul.dorfman@FCSO.COM>
Organization: http://groups.google.com
Subject: Re: Compound indexes and proc sql?
In-Reply-To: <slrnd6soo0.fhv.rs@mwilson.umlcoop.net>
Content-Type: text/plain; charset="iso-8859-1"
Rusty,
There is nothing ugly about your code - it is a sensible SAS-centric
way of asking the software exactly what you need based on the specific
knowledge about data organization (i.e. the presence of the index). Au
contraire, when you sumbit an SQL query, you also submit to the mercy
of the SQL optimizer. Luckily, in the scenario you have decribed, it is
likely to choose the fast way, albeit it will be still a bit slower
than the Data step. Let us give it a test shot:
28 data driver (keep = k1 k2) indexed (index = (kk = (k1 k2)) keep =
k1 k2 data) ;
29 do _n_ = 1 to 1e6 ;
30 k1 ++ ranuni (1) ;
31 k2 ++ ranuni (1) ;
32 data ++ 1 ;
33 if ranuni (1) < 0.01 then output driver ;
34 output indexed ;
35 end ;
36 run ;
NOTE: The data set USER.DRIVER has 10030 observations and 2 variables.
NOTE: The data set USER.INDEXED has 1000000 observations and 3
variables.
INFO: Multiple concurrent threads will be used to create the index.
NOTE: Composite index kk has been defined.
NOTE: DATA statement used (Total process time):
real time 3.31 seconds
cpu time 4.52 seconds
37
38 data _null_ ;
39 set driver ;
40 set indexed key = kk ;
41 if _iorc_ = 0 then output ;
42 else _error_ = 0 ;
43 run ;
NOTE: There were 10030 observations read from the data set USER.DRIVER.
NOTE: DATA statement used (Total process time):
real time 0.17 seconds
cpu time 0.17 seconds
44
45 option msglevel = i ;
46
47 proc sql _method ;
48 create table _null_ as
49 select x.k1, x.k1, x.data
50 from driver d, indexed x
51 where x.k1 = d.k1
52 and x.k2 = d.k2
53 ;
INFO: Index kk of SQL table USER.INDEXED (alias = X) selected for SQL
WHERE clause (join) optimization.
NOTE: SQL execution methods chosen are:
sqxcrta
sqxjndx
sqxsrc( USER.DRIVER(alias = D) )
sqxsrc( USER.INDEXED(alias = X) )
54 quit ;
NOTE: PROCEDURE SQL used (Total process time):
real time 0.36 seconds
cpu time 0.22 seconds
The SQXJNDX (and the preceding message in the log) indicate that the
optimizer has elected to use the index. By varying the cutoff in the
expression ranuni (1) < 0.01, I found that the optimized switched to
SQXSORT access method between 0.13 and 0.14. Since almost every
possible detail of this sort can now be found in the monumental (and
the first of a kind) paper on the SAS SQL optimizer presented at SUGI
30 by Russel Lavery. Here I will only note that the workings and logic
of the SAS optimizer differ quite radically from those in DB2 and
Oracle - as it should for at least two reasons. On one hand, at
sequential read, SAS is faster than either to the point of disbelief.
For example, in one of my real-life production scenarios, about 10m
rows are read from an UDB table (under AIX), and it takes 27:16.38 real
time. SAS sorts the result (i.e. reads, sorts it, and writes) in
1:47.98. Such a disparity may explain why the SAS optimizer seem to
lean heavily towards sequential processing. On the other hand, I
suspect that since consumers of SAS SQL are mostly "users" rather than,
say, highly SQL-versed DBAs, a decision cold have been made to optimize
joins rather than subqueries. In both DB2 and Oracle, a subquery on the
scenario above are much faster than an equivalent join. For example, in
UDB, the subquery below executed about 10 times faster than the join
shown above:
proc sql _method ;
create table _null_ as
select x.k1, x.k1, data
from indexed x
where exists (select 1 from driver d where x.k1 = d.k1 and x.k2 =
d.k2)
;
quit ;
Among DBAs, it is considered a staple test on efficiency proficiency.
However, the SAS optimizer would not even think of optimizing stuff
like this. At least, I do not have enough patience to wait until this
ends, while the join runs almost instantaneously in the scant 3
seconds. Interestingly, I have created two test tables just like above
in DB2 and ran the subquery above against them on the same server. It
took about 7 seconds to finish, i.e. twice as long as the SAS index
join.
Kind regards,
---------------------------
Paul M. Dorfman
Jacksonville, FL
---------------------------