LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (March 2000, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 13 Mar 2000 22:27:49 GMT
Reply-To:     joe_celko@MY-DEJA.COM
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         joe_celko@MY-DEJA.COM
Organization: Deja.com - Before you buy.
Subject:      Re: Celko's median using SQL

>> I tried the similar query on my own table. It took several hours to run and return 0 rows. Any idea <<

Show me the code.

>> Also, is it easy to create a function similar to AVG, SUM, etc.? <<

Why write them? Those function are built into SQL. This is taken from SQL FOR SMARTIES, second edition. Maybe it will hlep you see your problem.

23.2.8 Celko's Third Median

Another approach made easier with SQL-92 involves looking at a picture of a line of sorted values and seeing where the median would fall. Every value in column weight of the table partitions the table into three sections, the values which are less than weight, equal to weight or greater than weight. We can get a profile of each value with a tabular subquery expression.

Now the question is how to define a median in terms of the partitions. Clearly, the definition of a median means that if (lesser = greater) then weight is the median.

Now look at figure 23.?? for the other situations. If there are more greater values than half the size of the table, then weight cannot be a median. Likewise, if there are more lesser values than half the size of the table, then weight cannot be a median. If (lesser + equal) = greater, then weight is a left hand median. Likewise, if (greater + equal) = lesser, then weight is a right hand median. However, if weight is the median, then both lesser and greater have to have tallies less than half the size of the table. That translates into the following SQL.

SELECT AVG(DISTINCT weight) FROM (SELECT P1.pno, P1.weight, SUM(CASE WHEN P2.weight < P1.weight THEN 1 ELSE 0 END), SUM(CASE WHEN P2.weight = P1.weight THEN 1 ELSE 0 END), SUM(CASE WHEN P2.weight > P1.weight THEN 1 ELSE 0 END) FROM Parts AS P1, Parts AS P2 GROUP BY P1.pno, P1.weight) AS Partitions (pno, weight, lesser, equal, greater) WHERE lesser = greater OR (lesser <= (SELECT COUNT(*) FROM Parts)/2.0 AND greater <= (SELECT COUNT(*) FROM Parts)/2.0);

The reason for not expanding the VIEW in the FROM clause into a tabular subquery expression is that the table can be used for other partitions of the table, such as quintiles.

It is also worth noting that you can use either AVG(DISTINCT i) or AVG (i) in the SELECT clause. The AVG(DISTINCT i) will return the usual median when there are two values. This happens when you have an even number of rows and a partition in the middle, such as (1, 2, 2, 3, 3, 3) which has (2, 3) in the middle, which gives us 2.5 for the median. The AVG(i) will return the weighted median instead. This happens when you also factor in the number of times the two values in the middle of a table with an even number of rows. The table with (1, 2, 2, 3, 3, 3) would return (2, 2, 3, 3, 3) in the middle, which gives us 2.6 for the weighted median. The weighted median is a more accurate description of the data.

I sent this first attempt to Richard Romley, who invented the method of first working with groups when designing a query. It made it quite a bit simpler, but let me take you thru the steps so you can see the reasoning.

Look at the WHERE clause. It could use some algebra and since it deals only with aggregate functions and scalar subqueries, you could move it into a HAVING clause. Moving things from the WHERE clause into the HAVING clause in a grouped query is important for performance, but it is not always possible.

But first let's do some algebra on the expression in the WHERE clause.

lesser <= (SELECT COUNT(*) FROM Parts)/2.0

Since we already have lesser, equal, and greater for every row in the derived table Partitions, and since the sum of lesser, equal, and greater must always be exactly equal to the total number of rows in the Parts table, we can replace the scalar subquery with this expression:

lesser <= (lesser + equal + greater)/2.0

But this is the same as:

2.0 * lesser <= lesser + equal + greater

which becomes:

2.0 * lesser - lesser <= equal + greater

which becomes:

lesser <= equal + greater

So the query becomes:

SELECT AVG(DISTINCT weight) FROM (SELECT P1.pno, P1.weight, SUM(CASE when P2.weight < P1.weight THEN 1 ELSE 0 END), SUM(CASE when P2.weight = P1.weight THEN 1 ELSE 0 END), SUM(CASE when P2.weight > P1.weight THEN 1 ELSE 0 END) FROM Parts AS P1, Parts AS P2 GROUP BY P1.pno, P1.weight) AS Partitions (pno, weight, lesser, equal, greater) WHERE lesser = greater OR (lesser <= equal + greater AND greater <= equal + lesser);

Still looking at the WHERE clause, we can rewrite it with DeMorgan's law.

WHERE lesser = greater OR (equal >= lesser - greater AND equal >= greater - lesser)

But this is the same as:

WHERE lesser = greater OR equal >= ABS(lesser - greater)

But if the first condition was true (lesser = greater), the second must necessarily also be true (i.e. equal >= 0), so the first clause is redundant and can be eliminated completely.

WHERE equal >= ABS(lesser - greater)

So much for algebra. Instead of a WHERE clause operating on the columns of the derived table, why not perform the same test as a HAVING clause on the inner query which derives Partitions? This eliminates all but one column from the derived table, it will run lots faster, and it simplifies the query down to this:

SELECT AVG(DISTINCT weight) FROM (SELECT P1.weight FROM Parts AS P1, Parts AS P2 GROUP BY P1.pno, P1.weight HAVING SUM(CASE WHEN P2.weight = P1.weight THEN 1 ELSE 0 END) >= ABS(SUM(CASE WHEN P2.weight < P1.weight THEN 1 WHEN P2.weight > P1.weight THEN -1 ELSE 0 END))) AS Partitions;

If you prefer to use functions instead of a CASE expression, then use this version of the query:

SELECT AVG(DISTINCT weight) FROM (SELECT P1.weight FROM Parts AS P1, Parts AS P2 GROUP BY P1.pno, P1.weight HAVING SUM(ABS(1 - SIGN(P1.weight - P2.weight)) >= ABS(SUM(SIGN (P1.weight - P2.weight))) AS Partitions;

--CELKO--

Sent via Deja.com http://www.deja.com/ Before you buy.


Back to: Top of message | Previous page | Main SAS-L page