|
Thank you, David, for that very thorough reply. You are an amazing and
valuable resource on this list and I appreciate the time you spend
answering questions and solving problems.
The dataset in question is the California Health Interview Survey (CHIS)
public use file. Unfortunately it is a stratified file and they do not
include the STRATA variable in the public file. Argh! So now I am
trying to do it with SUDAAN. I've figured out how to generate some
numbers, but now the numbers I'm getting don't match the numbers that
the CHIS people themselves calculate.
I don't know if SUDAAN questions are kosher here, but on the off chance
they are...
Here is the code:
proc crosstab conf_lim= 95 data=F65plus filetype=sas design=jackknife;
weight rakedw0;
jackwgts rakedw1--rakedw80/adjjack= 1 ;
tables flushot*racedof;
class flushot racedof;
output /filename=work.table1 filetype=SAS tablecell=default replace;
run;
As an example of the discrepancies, the AskCHIS website (which allows
users to query the database through a nice web interface) gives this
info for Latinos who got a flu shot as:
(weighted N)
71.7% (67.3 - 76.0) 405,000
and SUDAAN gives the same results as:
71.7% (66.7 - 76.1) 405,329
They do round to the nearest thousand on the website, but I wouldn't
think there'd be rounding errors in the confidence intervals.
Any ideas why the discrepancy? I have also written to the CHIS people
but so far, no answers.
Sarah Carroll, PhD
Research Coordinator
DHS - Immunization Branch, MS 7313
2151 Berkeley Way, Room 723E, Berkeley CA 94704
tel: 510.540.2484 fax: 510.883.6015
email: scarroll@dhs.ca.gov
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
David L. Cassell
Sent: Wednesday, May 04, 2005 10:42 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: confidence intervals for percentages?
"Carroll, Sarah (DHS-DCDC-IMM)" <SCarroll@DHS.CA.GOV> replied:
> Hi Toby,
> thanks for your reply, but I can't figure out how the exact statement
> helps. I don't have 2x2 tables and I'm not calculating an odds ratio
or
> anything.... Here's the code I am working on:
>
> proc freq data=everybody;
> weight rakedw0; by agegroup; /* agegroup has 2 levels */
> tables flushot*racedof; /* flushot is 1 for yes, 2 for
> no, racedof has 7 levels */
> run;
>
> I have fooled around with the exact statement, and tried this most
> recently:
> exact or /alpha=.05;
>
> Thanks for any advice.
Toby was thoughtful enough to give me a heads-up on this question, so I
thought I'd contribute my $0.02 .
Of course, I have a *lot* of comments, some of which aren't all that
helpful. So let's get started.
[1] You have survey data, so you really need to be using PROC
SURVEYFREQ instead of PROC FREQ. The design effects have important
consequences here, so you can't ignore that you have a probability
design.
[2] You have raked weights. (I read the first post.) That tells me
that you or someone you trust has taken your data and developed new
weights using raking so that your weighted data should better mimic some
target population. You have to caveat this in your analysis and any
papers or presentations, because your sample frame may NOT be matching
up with the real intended target population.
Let me put this in a trivial but concrete form. If your real
target population is all voters in California, and you sample the adults
who can afford to have memberships at La Rancho Wealtho Country Club in
the most expensive part of Marin County, you may still have people under
thirty and black people in your sample. You may be able to rake the
sample so that your raked weights look right for proportion of people
who are black, and proportion of people who are under thirty. But when
your question is "Should we abolish the capital gains tax?" your answers
will NEVER be transferable to the real target population no matter how
you fiddle the weights. So you have to 'fess up and admit that you have
done raking and you are making an assumption that your sample is
equivalent to a probabilistic sample from the target population.
One thing you might try is running the analysis with the CORRECT
weights for your sample, and then with the raked weights, and seeing if
there are important differences. If not, then at least the raking isn't
totally distorting the content of the sample data. You still can't
really correct for the unknown bias due to your target population not
aligning with the sample frame, but raking is about as close as you can
get to that goal.
Have I bored you into a coma yet? Okay, let's keep going...
[3] You may have complex design features which you had to ignore in
order to use PROC FREQ. Do you have a stratified sample? Do you have
clustering? Do you have a multi-stage sample? Do you know the actual
sampling rate for your sample, or the size of the sample frame used to
build the sample? You need these pieces of information in order to use
PROC SURVEYFREQ correctly.
[4] If you have survived to this point, then PROC SURVEYFREQ will solve
your problem for you. Here's your example, with a couple tweaks:
proc SURVEYfreq data=everybody TOTAL=54,378 ;
weight rakedw0;
by agegroup;
tables flushot*racedof / CL ALPHA=.05 ;
run;
I changed FREQ to SURVEYFREQ.
I assumed you could find that your real sample frame was 54,378
people, from which the sample was built. Find the correct sampling
rate or frame size, and use that instead of my number.
I assumed there were no strata and no clustering, which is something
that you'll have to determine. If you have either, then you need
to add additional statements.
I set the confidence intervals at 95% .
CL gives confidence intervals for percents. If you decide you'd
rather have confidence intervals for weighted frequencies, then
you should switch to the keyword CLWT instead.
HTH,
David
--
David Cassell, CSC
Cassell.David@epa.gov
Senior computing specialist
mathematical statistician
|