Date: Mon, 22 Jun 1998 17:00:54 -0400
Reply-To: WHITLOI1 <WHITLOI1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@UGA.CC.UGA.EDU>
From: WHITLOI1 <WHITLOI1@WESTAT.COM>
Subject: Re[2]: Moving average in SQL -- Optimization for large datas
Content-Type: text/plain; charset=US-ASCII
Subject: Re: Moving average in SQL -- Optimization for large datasets
Summary: It is worth increasing the complexity of the problem to
replace a cartesian join with an equi-join.
Respondent: Ian Whitlock <whitloi1@westat.com>
Karsten M. Self <kmself@IX.NETCOM.COM> desired to improve the
performance of some SQL code to calculate statistics for all near by
records and thus produce a "moving average" dependent on the distance
from a given record rather than a fixed number of records. His
original solution was too slow.
> /* Moving stats in SQL */
> proc sql _method;
> create table mstat as
> select
> left.b,
> count(*) as n,
> mean( right.x ) as mx,
> std( right.x ) as sx,
> mean( right.y ) as my,
> std( right.y ) as sy
> from foo as left, foo as right
> where right.b - left.b le 10 and right.b gt left.b
> group by left.b
> order by left.b
> ;
> quit;
The problem with the above code that it involves subsetting a
Cartesian join which grows to a tremendous size with a moderately
sized data set FOO. For example, 20,000 * 20,000 is 400 million and
that is beginning to get large while 20,000 is quite modest in size.
A reflexive join is essential for this problem, since the given point
of interest must be determined by a record in one data set and the
group of nearby records must come from an independent copy of this
set. The answer is not to remove the join, but to turn it into a more
palatable equi-join. The technique was originally given on SAS-L by
Howard Schreier in 1993 in answer to a question about how to compare
function values F(X) and F(Y) when X is near Y. The method is well
worth learning, since the question arises in a number of places quite
naturally and the solution is far from intuitively obvious, because it
introduces another set to the join.
Here is the code.
data offset ;
do offset = 0 , 10 ; output ; end ;
run ;
> /* Moving stats in SQL */
> proc sql _method;
> create table mstat as
> select
> left.b,
> count(*) as n,
> mean( right.x ) as mx,
> std( right.x ) as sx,
> mean( right.y ) as my,
> std( right.y ) as sy
> from foo as left,
offset ,
> foo as right
where round ( left.b ) + offset = round ( right.b )
> and right.b - left.b le 10 and right.b gt left.b
> group by left.b
> order by left.b
> ;
> quit;
The extra WHERE condition
round ( left.b ) + offset = round ( right.b )
placed first does not change the result, since it includes every pair
satisfied by the second condition, but it introduces another method of
solution for the SQL compiler (round then sort and merge) instead of
(check all pairs). For 20,000 records with fewer statistics the faster
solution took around 6 seconds, while the slower took over 20 minutes!
It might help to see how this happened. The set of records defined by
the second condition is rather complex in comparison to the simple
first condition that the rounds be equal or next to each other. The
set detrmined by rounding is less discriminating, but it can be
quickly constructed and easily subset to the desired result. Hence it
is far more efficient. Paul Kent reviewed the technique in a paper
that first appeared at SESUG 1994.
Incidentally the order of the WHERE clause parts is essential in 6.12.
This shows that it is still important to order conditions to help the
SQL compiler while explaining "The beauty of SQL is that it doesn't
matter."
Karsten's "Untested code (from memory, sorry)" improvement may be good
enough for some purposes, but it cannot be right, since it merely
groups records about 0, 10, 20, etc. and calculates statistics for
each group with the identifying center of the group shifted by 5.
> /* SQL data aggregating algorithm */
> proc sql _method;
> create table mstat as
> select
> min(b) as b,
> /* 2nd argument to ROUND is smoothing block. Shift result
> by
> * 0.5 of this value to select midpoint
> */
> round( b, 10 ) +5 as midpoint,
> mean(b) as mb,
> std(b) as sb,
> count(*) as n,
> mean( x ) as mx,
> std( x ) as sx,
> mean( y ) as my,
> std( y ) as sy
> from foo
> group by calculated midpoint,
> order by calculated b
> ;
> quit;
Instead of
> round( b, 10 ) +5 as midpoint,
Karsten probably had in mind
round( b + 5 , 10 ) - 5 as midpoint,
which would actually group about 5, 10, 15 etc. instead of providing
these labels as misleading group identifiers.
Ian Whitlock