Date: Mon, 13 Mar 2000 22:27:49 GMT
Reply-To: joe_celko@MY-DEJA.COM
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: joe_celko@MY-DEJA.COM
Organization: Deja.com - Before you buy.
Subject: Re: Celko's median using SQL
>> I tried the similar query on my own table. It took several hours to
run and return 0 rows. Any idea <<
Show me the code.
>> Also, is it easy to create a function similar to AVG, SUM, etc.? <<
Why write them? Those function are built into SQL. This is taken from
SQL FOR SMARTIES, second edition. Maybe it will hlep you see your
problem.
23.2.8 Celko's Third Median
Another approach made easier with SQL-92 involves looking at a picture
of a line of sorted values and seeing where the median would fall.
Every value in column weight of the table partitions the table into
three sections, the values which are less than weight, equal to weight
or greater than weight. We can get a profile of each value with a
tabular subquery expression.
Now the question is how to define a median in terms of the partitions.
Clearly, the definition of a median means that if (lesser = greater)
then weight is the median.
Now look at figure 23.?? for the other situations. If there are more
greater values than half the size of the table, then weight cannot be a
median. Likewise, if there are more lesser values than half the size
of the table, then weight cannot be a median. If (lesser + equal) =
greater, then weight is a left hand median. Likewise, if (greater +
equal) = lesser, then weight is a right hand median. However, if
weight is the median, then both lesser and greater have to have tallies
less than half the size of the table. That translates into the
following SQL.
SELECT AVG(DISTINCT weight)
FROM (SELECT P1.pno, P1.weight,
SUM(CASE WHEN P2.weight < P1.weight
THEN 1 ELSE 0 END),
SUM(CASE WHEN P2.weight = P1.weight
THEN 1 ELSE 0 END),
SUM(CASE WHEN P2.weight > P1.weight
THEN 1 ELSE 0 END)
FROM Parts AS P1, Parts AS P2
GROUP BY P1.pno, P1.weight)
AS Partitions (pno, weight, lesser, equal, greater)
WHERE lesser = greater
OR (lesser <= (SELECT COUNT(*) FROM Parts)/2.0
AND greater <= (SELECT COUNT(*) FROM Parts)/2.0);
The reason for not expanding the VIEW in the FROM clause into a tabular
subquery expression is that the table can be used for other partitions
of the table, such as quintiles.
It is also worth noting that you can use either AVG(DISTINCT i) or AVG
(i) in the SELECT clause. The AVG(DISTINCT i) will return the usual
median when there are two values. This happens when you have an even
number of rows and a partition in the middle, such as (1, 2, 2, 3, 3,
3) which has (2, 3) in the middle, which gives us 2.5 for the median.
The AVG(i) will return the weighted median instead. This happens when
you also factor in the number of times the two values in the middle of
a table with an even number of rows. The table with (1, 2, 2, 3, 3, 3)
would return (2, 2, 3, 3, 3) in the middle, which gives us 2.6 for the
weighted median. The weighted median is a more accurate description of
the data.
I sent this first attempt to Richard Romley, who invented the method of
first working with groups when designing a query. It made it quite a
bit simpler, but let me take you thru the steps so you can see the
reasoning.
Look at the WHERE clause. It could use some algebra and since it deals
only with aggregate functions and scalar subqueries, you could move it
into a HAVING clause. Moving things from the WHERE clause into the
HAVING clause in a grouped query is important for performance, but it
is not always possible.
But first let's do some algebra on the expression in the WHERE clause.
lesser <= (SELECT COUNT(*) FROM Parts)/2.0
Since we already have lesser, equal, and greater for every row in the
derived table Partitions, and since the sum of lesser, equal, and
greater must always be exactly equal to the total number of rows in the
Parts table, we can replace the scalar subquery with this expression:
lesser <= (lesser + equal + greater)/2.0
But this is the same as:
2.0 * lesser <= lesser + equal + greater
which becomes:
2.0 * lesser - lesser <= equal + greater
which becomes:
lesser <= equal + greater
So the query becomes:
SELECT AVG(DISTINCT weight)
FROM (SELECT P1.pno, P1.weight,
SUM(CASE when P2.weight < P1.weight
THEN 1 ELSE 0 END),
SUM(CASE when P2.weight = P1.weight
THEN 1 ELSE 0 END),
SUM(CASE when P2.weight > P1.weight
THEN 1 ELSE 0 END)
FROM Parts AS P1, Parts AS P2
GROUP BY P1.pno, P1.weight)
AS Partitions (pno, weight, lesser, equal, greater)
WHERE lesser = greater
OR (lesser <= equal + greater
AND greater <= equal + lesser);
Still looking at the WHERE clause, we can rewrite it with DeMorgan's
law.
WHERE lesser = greater
OR (equal >= lesser - greater
AND equal >= greater - lesser)
But this is the same as:
WHERE lesser = greater
OR equal >= ABS(lesser - greater)
But if the first condition was true (lesser = greater), the second must
necessarily also be true (i.e. equal >= 0), so the first clause is
redundant and can be eliminated completely.
WHERE equal >= ABS(lesser - greater)
So much for algebra. Instead of a WHERE clause operating on the
columns of the derived table, why not perform the same test as a HAVING
clause on the inner query which derives Partitions? This eliminates
all but one column from the derived table, it will run lots faster, and
it simplifies the query down to this:
SELECT AVG(DISTINCT weight)
FROM (SELECT P1.weight
FROM Parts AS P1, Parts AS P2
GROUP BY P1.pno, P1.weight
HAVING SUM(CASE WHEN P2.weight = P1.weight
THEN 1 ELSE 0 END)
>= ABS(SUM(CASE WHEN P2.weight < P1.weight THEN 1
WHEN P2.weight > P1.weight THEN -1
ELSE 0 END)))
AS Partitions;
If you prefer to use functions instead of a CASE expression, then use
this version of the query:
SELECT AVG(DISTINCT weight)
FROM (SELECT P1.weight
FROM Parts AS P1, Parts AS P2
GROUP BY P1.pno, P1.weight
HAVING SUM(ABS(1 - SIGN(P1.weight - P2.weight))
>= ABS(SUM(SIGN (P1.weight - P2.weight)))
AS Partitions;
--CELKO--
Sent via Deja.com http://www.deja.com/
Before you buy.