```Date: Mon, 22 Jun 1998 17:00:54 -0400 Reply-To: WHITLOI1 Sender: "SAS(r) Discussion" From: WHITLOI1 Subject: Re[2]: Moving average in SQL -- Optimization for large datas Comments: To: "Karsten M. Self" Content-Type: text/plain; charset=US-ASCII Subject: Re: Moving average in SQL -- Optimization for large datasets Summary: It is worth increasing the complexity of the problem to replace a cartesian join with an equi-join. Respondent: Ian Whitlock Karsten M. Self desired to improve the performance of some SQL code to calculate statistics for all near by records and thus produce a "moving average" dependent on the distance from a given record rather than a fixed number of records. His original solution was too slow. > /* Moving stats in SQL */ > proc sql _method; > create table mstat as > select > left.b, > count(*) as n, > mean( right.x ) as mx, > std( right.x ) as sx, > mean( right.y ) as my, > std( right.y ) as sy > from foo as left, foo as right > where right.b - left.b le 10 and right.b gt left.b > group by left.b > order by left.b > ; > quit; The problem with the above code that it involves subsetting a Cartesian join which grows to a tremendous size with a moderately sized data set FOO. For example, 20,000 * 20,000 is 400 million and that is beginning to get large while 20,000 is quite modest in size. A reflexive join is essential for this problem, since the given point of interest must be determined by a record in one data set and the group of nearby records must come from an independent copy of this set. The answer is not to remove the join, but to turn it into a more palatable equi-join. The technique was originally given on SAS-L by Howard Schreier in 1993 in answer to a question about how to compare function values F(X) and F(Y) when X is near Y. The method is well worth learning, since the question arises in a number of places quite naturally and the solution is far from intuitively obvious, because it introduces another set to the join. Here is the code. data offset ; do offset = 0 , 10 ; output ; end ; run ; > /* Moving stats in SQL */ > proc sql _method; > create table mstat as > select > left.b, > count(*) as n, > mean( right.x ) as mx, > std( right.x ) as sx, > mean( right.y ) as my, > std( right.y ) as sy > from foo as left, offset , > foo as right where round ( left.b ) + offset = round ( right.b ) > and right.b - left.b le 10 and right.b gt left.b > group by left.b > order by left.b > ; > quit; The extra WHERE condition round ( left.b ) + offset = round ( right.b ) placed first does not change the result, since it includes every pair satisfied by the second condition, but it introduces another method of solution for the SQL compiler (round then sort and merge) instead of (check all pairs). For 20,000 records with fewer statistics the faster solution took around 6 seconds, while the slower took over 20 minutes! It might help to see how this happened. The set of records defined by the second condition is rather complex in comparison to the simple first condition that the rounds be equal or next to each other. The set detrmined by rounding is less discriminating, but it can be quickly constructed and easily subset to the desired result. Hence it is far more efficient. Paul Kent reviewed the technique in a paper that first appeared at SESUG 1994. Incidentally the order of the WHERE clause parts is essential in 6.12. This shows that it is still important to order conditions to help the SQL compiler while explaining "The beauty of SQL is that it doesn't matter." Karsten's "Untested code (from memory, sorry)" improvement may be good enough for some purposes, but it cannot be right, since it merely groups records about 0, 10, 20, etc. and calculates statistics for each group with the identifying center of the group shifted by 5. > /* SQL data aggregating algorithm */ > proc sql _method; > create table mstat as > select > min(b) as b, > /* 2nd argument to ROUND is smoothing block. Shift result > by > * 0.5 of this value to select midpoint > */ > round( b, 10 ) +5 as midpoint, > mean(b) as mb, > std(b) as sb, > count(*) as n, > mean( x ) as mx, > std( x ) as sx, > mean( y ) as my, > std( y ) as sy > from foo > group by calculated midpoint, > order by calculated b > ; > quit; Instead of > round( b, 10 ) +5 as midpoint, Karsten probably had in mind round( b + 5 , 10 ) - 5 as midpoint, which would actually group about 5, 10, 15 etc. instead of providing these labels as misleading group identifiers. Ian Whitlock ```

Back to: Top of message | Previous page | Main SAS-L page