LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (May 2005, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Wed, 4 May 2005 13:01:51 -0700
Reply-To:   "Carroll, Sarah (DHS-DCDC-IMM)" <SCarroll@DHS.CA.GOV>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   "Carroll, Sarah (DHS-DCDC-IMM)" <SCarroll@DHS.CA.GOV>
Subject:   Re: confidence intervals for percentages?
Comments:   To: cassell.david@EPAMAIL.EPA.GOV
Content-Type:   text/plain; charset="us-ascii"

Thank you, David, for that very thorough reply. You are an amazing and valuable resource on this list and I appreciate the time you spend answering questions and solving problems.

The dataset in question is the California Health Interview Survey (CHIS) public use file. Unfortunately it is a stratified file and they do not include the STRATA variable in the public file. Argh! So now I am trying to do it with SUDAAN. I've figured out how to generate some numbers, but now the numbers I'm getting don't match the numbers that the CHIS people themselves calculate.

I don't know if SUDAAN questions are kosher here, but on the off chance they are... Here is the code:

proc crosstab conf_lim= 95 data=F65plus filetype=sas design=jackknife; weight rakedw0; jackwgts rakedw1--rakedw80/adjjack= 1 ; tables flushot*racedof; class flushot racedof; output /filename=work.table1 filetype=SAS tablecell=default replace; run;

As an example of the discrepancies, the AskCHIS website (which allows users to query the database through a nice web interface) gives this info for Latinos who got a flu shot as:

(weighted N) 71.7% (67.3 - 76.0) 405,000

and SUDAAN gives the same results as: 71.7% (66.7 - 76.1) 405,329

They do round to the nearest thousand on the website, but I wouldn't think there'd be rounding errors in the confidence intervals.

Any ideas why the discrepancy? I have also written to the CHIS people but so far, no answers.

Sarah Carroll, PhD Research Coordinator DHS - Immunization Branch, MS 7313 2151 Berkeley Way, Room 723E, Berkeley CA 94704 tel: 510.540.2484 fax: 510.883.6015 email: scarroll@dhs.ca.gov

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of David L. Cassell Sent: Wednesday, May 04, 2005 10:42 AM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: confidence intervals for percentages?

"Carroll, Sarah (DHS-DCDC-IMM)" <SCarroll@DHS.CA.GOV> replied: > Hi Toby, > thanks for your reply, but I can't figure out how the exact statement > helps. I don't have 2x2 tables and I'm not calculating an odds ratio or > anything.... Here's the code I am working on: > > proc freq data=everybody; > weight rakedw0; by agegroup; /* agegroup has 2 levels */ > tables flushot*racedof; /* flushot is 1 for yes, 2 for > no, racedof has 7 levels */ > run; > > I have fooled around with the exact statement, and tried this most > recently: > exact or /alpha=.05; > > Thanks for any advice.

Toby was thoughtful enough to give me a heads-up on this question, so I thought I'd contribute my $0.02 .

Of course, I have a *lot* of comments, some of which aren't all that helpful. So let's get started.

[1] You have survey data, so you really need to be using PROC SURVEYFREQ instead of PROC FREQ. The design effects have important consequences here, so you can't ignore that you have a probability design.

[2] You have raked weights. (I read the first post.) That tells me that you or someone you trust has taken your data and developed new weights using raking so that your weighted data should better mimic some target population. You have to caveat this in your analysis and any papers or presentations, because your sample frame may NOT be matching up with the real intended target population. Let me put this in a trivial but concrete form. If your real target population is all voters in California, and you sample the adults who can afford to have memberships at La Rancho Wealtho Country Club in the most expensive part of Marin County, you may still have people under thirty and black people in your sample. You may be able to rake the sample so that your raked weights look right for proportion of people who are black, and proportion of people who are under thirty. But when your question is "Should we abolish the capital gains tax?" your answers will NEVER be transferable to the real target population no matter how you fiddle the weights. So you have to 'fess up and admit that you have done raking and you are making an assumption that your sample is equivalent to a probabilistic sample from the target population. One thing you might try is running the analysis with the CORRECT weights for your sample, and then with the raked weights, and seeing if there are important differences. If not, then at least the raking isn't totally distorting the content of the sample data. You still can't really correct for the unknown bias due to your target population not aligning with the sample frame, but raking is about as close as you can get to that goal.

Have I bored you into a coma yet? Okay, let's keep going...

[3] You may have complex design features which you had to ignore in order to use PROC FREQ. Do you have a stratified sample? Do you have clustering? Do you have a multi-stage sample? Do you know the actual sampling rate for your sample, or the size of the sample frame used to build the sample? You need these pieces of information in order to use PROC SURVEYFREQ correctly.

[4] If you have survived to this point, then PROC SURVEYFREQ will solve your problem for you. Here's your example, with a couple tweaks:

proc SURVEYfreq data=everybody TOTAL=54,378 ; weight rakedw0; by agegroup; tables flushot*racedof / CL ALPHA=.05 ; run;

I changed FREQ to SURVEYFREQ. I assumed you could find that your real sample frame was 54,378 people, from which the sample was built. Find the correct sampling rate or frame size, and use that instead of my number. I assumed there were no strata and no clustering, which is something that you'll have to determine. If you have either, then you need to add additional statements. I set the confidence intervals at 95% . CL gives confidence intervals for percents. If you decide you'd rather have confidence intervals for weighted frequencies, then you should switch to the keyword CLWT instead.

HTH, David -- David Cassell, CSC Cassell.David@epa.gov Senior computing specialist mathematical statistician


Back to: Top of message | Previous page | Main SAS-L page