|
Wensui,
Sorry, I thought I was clear. The error is in the *data* on the second
line, the year in *my* test data is 2049 which is wrong, as you suggested
spurious years can occur in databases. When I executed your code it
produced
company delay2002 delay2003
------------------------------
A 100.0% 66.67%
B 100.0% 25.00%
The percentage for company A year 2002 should have been 50% not 100%. When
I ran Toby's code on this data, I got
Obs company Year2002 Year2003 Year2049
1 A 100 66.6667 0
2 B 100 25.0000 .
This report is also wrong, but it has the distinct advantage that the
reader can see there is an error in the data for company A, and therefor
one has a chance to fix the data.
Programming for generality could not catch every error. For example, if
the year on the second line had been 2003 when it should have been 2002
there would have been nothing obvious in the report to show the error.
However, the original question was not about catching errors. I only
mentioned it because you claimed:
Well, the good side of my way is that you won't include the funky year
you don't want. It is not surprise to have year "2049" in many
databases.
I would much rather be surprised by the funky year in Toby's report than by
having a manager walk into my office and ask why I reported 100% delayed
flights for company A in year 2002 when the marketing department estimated
that 50% of the flights were on time in that year.
I agree that your code is not in error - on the correct data it produces
the correct report. In fact, I did not make that claim. I only claimed
that it was not an advantage to suppress dirty data in reports. Since you
offered, handling a bad year in the data base, as a benefit of your code, I
thought it important to show that it was a questionable benefit by running
with a dirty year that you had suggested.
Toby and DATA _NULL_ made an argument for coding for generality, based on
the issue of maintainability. After reading your claim, I wanted to show
that another argument, based on catching some data base errors, could also
be made.
Ian Whitlock
=================
Date: Fri, 16 Jun 2006 15:06:20 -0400
Reply-To: Wensui Liu <liuwensui@GMAIL.COM>
Sender: "SAS(r) Discussion"
From: Wensui Liu <liuwensui@GMAIL.COM>
Subject: Re: Percentage of subgroups
Comments: To: iw1junk@comcast.net
In-Reply-To:
<061620061751.17657.4492EFB600020B51000044F9220588644205029A06CE9907@comcas
t.net>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
could you please point out my error in a more explicit way rather than
just saying 'it is wrong' or borrowing others' words?
I really appreciate it.
On 6/16/06, iw1junk@comcast.net <iw1junk@comcast.net> wrote:
>
>
> Several respondents provided answers to Maarten's question on how to
> calculate percentages. Wensui presented his code
>
>
> > Here is a more elegent solution in SQL:
> >
> > proc sql;
> > create table
> > wensui as
> > select
> > company,
> > sum(case when year = 2002 and delay ~= . then 1 else 0 end)/
> > sum(case when year = 2002 then 1 else 0 end) as delay2002 format
> > = percent8.2,
> > sum(case when year = 2003 and delay ~= . then 1 else 0 end)/
> > sum(case when year = 2003 then 1 else 0 end) as delay2003 format
> > = percent8.2
> > from
> > test
> > group by
> > company;
> > quit;
>
>
> and then defended it with the statement
>
>
> >Well, the good side of my way is that you won't include the funky year
> >you don't want. It is not surprise to have year "2049" in many
> >databases.
>
>
> To test his assertion I ran his code with
>
> data test;
> input flight company $ year delay;
> datalines;
> 1 A 2002 13
> 2 A 2049 .
>
> 3 A 2003 5
> 4 A 2003 12
> 5 A 2003 .
> 6 B 2002 17
> 7 B 2002 28
> 8 B 2003 32
> 9 B 2003 .
> 10 B 2003 .
> 11 B 2003 .
> ;
>
> and found it produced
>
> company delay2002 delay2003
> ------------------------------
> A 100.0% 66.67%
>
> B 100.0% 25.00%
>
>
> This result is wrong, but with no hint that there is an error in the
> data.
>
> Either Toby or Yu's code would have pointed out the error. This gives
> another reason, in addition to maintenance, why generality can be
important.
>
> Thanks to Doug for a simple TABULATE solution. I thought that was the
> way to go, but got miss lead by the other general approaches.
>
> Ian Whitlock
--
WenSui Liu
(http://spaces.msn.com/statcompute/blog)
Senior Decision Support Analyst
Health Policy and Clinical Effectiveness
Cincinnati Children Hospital Medical Center
|