LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (June 2006, week 3)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Fri, 16 Jun 2006 21:15:05 +0000
Reply-To:   iw1junk@COMCAST.NET
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Ian Whitlock <iw1junk@COMCAST.NET>
Subject:   Re: Percentage of subgroups
Comments:   cc: Wensui Liu <liuwensui@GMAIL.COM>

Wensui,

Sorry, I thought I was clear. The error is in the *data* on the second line, the year in *my* test data is 2049 which is wrong, as you suggested spurious years can occur in databases. When I executed your code it produced

company delay2002 delay2003 ------------------------------ A 100.0% 66.67% B 100.0% 25.00%

The percentage for company A year 2002 should have been 50% not 100%. When I ran Toby's code on this data, I got

Obs company Year2002 Year2003 Year2049 1 A 100 66.6667 0 2 B 100 25.0000 .

This report is also wrong, but it has the distinct advantage that the reader can see there is an error in the data for company A, and therefor one has a chance to fix the data.

Programming for generality could not catch every error. For example, if the year on the second line had been 2003 when it should have been 2002 there would have been nothing obvious in the report to show the error. However, the original question was not about catching errors. I only mentioned it because you claimed:

Well, the good side of my way is that you won't include the funky year you don't want. It is not surprise to have year "2049" in many databases.

I would much rather be surprised by the funky year in Toby's report than by having a manager walk into my office and ask why I reported 100% delayed flights for company A in year 2002 when the marketing department estimated that 50% of the flights were on time in that year.

I agree that your code is not in error - on the correct data it produces the correct report. In fact, I did not make that claim. I only claimed that it was not an advantage to suppress dirty data in reports. Since you offered, handling a bad year in the data base, as a benefit of your code, I thought it important to show that it was a questionable benefit by running with a dirty year that you had suggested.

Toby and DATA _NULL_ made an argument for coding for generality, based on the issue of maintainability. After reading your claim, I wanted to show that another argument, based on catching some data base errors, could also be made.

Ian Whitlock ================= Date: Fri, 16 Jun 2006 15:06:20 -0400 Reply-To: Wensui Liu <liuwensui@GMAIL.COM> Sender: "SAS(r) Discussion" From: Wensui Liu <liuwensui@GMAIL.COM> Subject: Re: Percentage of subgroups Comments: To: iw1junk@comcast.net In-Reply-To: <061620061751.17657.4492EFB600020B51000044F9220588644205029A06CE9907@comcas t.net> Content-Type: text/plain; charset=ISO-8859-1; format=flowed could you please point out my error in a more explicit way rather than just saying 'it is wrong' or borrowing others' words? I really appreciate it. On 6/16/06, iw1junk@comcast.net <iw1junk@comcast.net> wrote: > > > Several respondents provided answers to Maarten's question on how to > calculate percentages. Wensui presented his code > > > > Here is a more elegent solution in SQL: > > > > proc sql; > > create table > > wensui as > > select > > company, > > sum(case when year = 2002 and delay ~= . then 1 else 0 end)/ > > sum(case when year = 2002 then 1 else 0 end) as delay2002 format > > = percent8.2, > > sum(case when year = 2003 and delay ~= . then 1 else 0 end)/ > > sum(case when year = 2003 then 1 else 0 end) as delay2003 format > > = percent8.2 > > from > > test > > group by > > company; > > quit; > > > and then defended it with the statement > > > >Well, the good side of my way is that you won't include the funky year > >you don't want. It is not surprise to have year "2049" in many > >databases. > > > To test his assertion I ran his code with > > data test; > input flight company $ year delay; > datalines; > 1 A 2002 13 > 2 A 2049 . > > 3 A 2003 5 > 4 A 2003 12 > 5 A 2003 . > 6 B 2002 17 > 7 B 2002 28 > 8 B 2003 32 > 9 B 2003 . > 10 B 2003 . > 11 B 2003 . > ; > > and found it produced > > company delay2002 delay2003 > ------------------------------ > A 100.0% 66.67% > > B 100.0% 25.00% > > > This result is wrong, but with no hint that there is an error in the > data. > > Either Toby or Yu's code would have pointed out the error. This gives > another reason, in addition to maintenance, why generality can be important. > > Thanks to Doug for a simple TABULATE solution. I thought that was the > way to go, but got miss lead by the other general approaches. > > Ian Whitlock

-- WenSui Liu (http://spaces.msn.com/statcompute/blog) Senior Decision Support Analyst Health Policy and Clinical Effectiveness Cincinnati Children Hospital Medical Center


Back to: Top of message | Previous page | Main SAS-L page