LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (May 2007, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Sun, 13 May 2007 22:22:48 -0700
Reply-To:     David L Cassell <davidlcassell@MSN.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         David L Cassell <davidlcassell@MSN.COM>
Subject:      Re: how to calculate the median by groups
In-Reply-To:  <200705131824.l4DAkdgw013722@mailgw.cc.uga.edu>
Content-Type: text/plain; format=flowed

Howard expertly essayed: > >On Mon, 7 May 2007 22:26:07 -0700, David L Cassell <davidlcassell@MSN.COM> >wrote: > > >Huang.JS@PRINCIPAL.COM replied: > >[snip] > > > > >This works. > > > >BUT it does not scale well, and it is about the least efficient way to > >compute medians. Okay, second-least efficient: there's a way to do > >it in SQL that looks a lot less efficient. > > > >SAS has methods of computing medians (and other quantiles) that are > >state-of-the-art, and can even handle *huge* data sets. I recommend > >using the SAS procs designed for these types of processes. > > > >I'm still waiting to find out why Austina wants medians, and why she >wants > >medians in SQL... > > > >HTCT, > >David > >-- > >David L. Cassell > >mathematical statistician > >Design Pathways > >3115 NW Norwood Pl. > >Corvallis OR 97330 > >I agree with David. > >But perhaps medians in SQL are are worth exploring out of curiosity. Celko >has 8 different ways (SQL for Smarties, 2nd ed., pp. 353-365). I think it's >safe to say that they are all convoluted and all inefficient. > >So I set out to develop my own convoluted and inefficient PROC SQL code to >compute medians. > >Test data: > > data test; > input groupID $ measure; > cards; > a 1.1 > a 2.1 > a 3.1 > a 4.1 > b 1.1 > b 2.1 > b 3.1 > c 101 > d . > d . > ; > >The first task is to get the half-populations: > > proc sql; > > create table halfpoints as > select groupid, floor(count(measure) / 2) as halfpoint > from test > group by groupid; > >Result: > > groupID halfpoint > ------------------- > a 2 > b 1 > c 0 > d 0 > >Next, get the cumulative frequencies, in both directions: > > create view first as > select measures.groupid > , measures.measure > , sum(measures.measure > test.measure) as lohi > , sum(measures.measure < test.measure) as hilo > from (select distinct groupid, measure from test) as measures > join > test > on measures.groupid=test.groupid > group by measures.groupid, measures.measure; > >Result: > > groupID measure lohi hilo > -------------------------------------- > a 1.1 0 3 > a 2.1 1 2 > a 3.1 2 1 > a 4.1 3 0 > b 1.1 0 2 > b 2.1 1 1 > b 3.1 2 0 > c 101 0 0 > d . 0 0 > >Now derive what might be called "candidate medians": > > create view second as > select groupid > , case when lohi <= (select halfpoint > from halfpoints > where groupid=first.groupid) > then measure > else . > end as lohi, > case when hilo <= (select halfpoint > from halfpoints > where groupid=first.groupid) > then measure > else . > end as hilo > from first; > >Result: > > groupID lohi hilo > ---------------------------- > a 1.1 . > a 2.1 2.1 > a 3.1 3.1 > a . 4.1 > b 1.1 . > b 2.1 2.1 > b . 3.1 > c 101 101 > d . . > >Finally, pick the extreme values and average them: > > create table medians as > select groupid, mean(max(lohi),min(hilo) ) as Median > from second > group by groupid; > >Result: > > groupID Median > ------------------ > a 2.6 > b 2.1 > c 101 > d . > >Or, get the same answers by rolling all of the views into one big >statement: > > create table medians as > select groupid > , mean( max(case when lohi <= (select halfpoint > from halfpoints > where groupid=outer.groupid) > then measure > else . > end) > , min(case when hilo <= (select halfpoint > from halfpoints > where groupid=outer.groupid) > then measure > else . > end) > ) as median > from (select measures.groupid > , measures.measure > , sum(measures.measure > test.measure) as lohi > , sum(measures.measure < test.measure) as hilo > from (select distinct groupid, measure from test) as measures > join > test > on measures.groupid=test.groupid > group by measures.groupid, measures.measure > ) as outer > group by groupid > ; > > quit;

Yowsers!

Can you do a weighted median too? I spent some time last summer trying to do that in PROC SQL, without enough satisfactory results to use it in production code.

Your HALFPOINT variable would have to be

> proc sql; > > create table halfpoints as > select groupid, sum(SamplingWeight)/2 as halfpoint > from test > group by groupid;

And you would still have to check whether the halfpoint falls right in between two distinct values, evne if that is now a much smaller likelihood.

David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330

_________________________________________________________________ More photos, more messages, more storage—get 2GB with Windows Live Hotmail. http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migration_HM_mini_2G_0507


Back to: Top of message | Previous page | Main SAS-L page