|Date: ||Wed, 22 Aug 2007 22:51:10 -0700|
|Reply-To: ||David L Cassell <davidlcassell@MSN.COM>|
|Sender: ||"SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>|
|From: ||David L Cassell <davidlcassell@MSN.COM>|
|Subject: ||Re: test non-paired dependent samples|
|Content-Type: ||text/plain; format=flowed|
shenjie@STAT.UCLA.EDU wrote back:
>David L Cassell wrote:
>>>I'm looking for an appropriate statistical test for the difference in
>>>dependent but not paired samples. I've got an internet survey data
>>>and want to compare it with national survey data. I used proc
>>>surveymeans to get summary statistics (n, mean, and standard error)
>>>of the national survey data. I want to compare variables like age
>>>(continuous), education (categorical) in our survey data, A, versus
>>>those in the national survey data, B. However, A is a subset of the
>>>larger population B. They are not independent but not paired like
>>>before vs after in the same population. So It's like we are comparing
>>>sample A vs. (population B - sample A). Can anyone recommend a
>>>suitable statistical test for this? Thank you.
>>First off, I don't think these are 'dependent' samples. If your
>>national database is so large that you are sure Sample B is a subset
>>of it, then that national database ought to be close to a census of
>>the target population. In that case, you have a standard,
>>againstwhich you want to compare a sample.
>>Second, internet survey data are usually.. umm.. how can I put this...
>>Sorry, my politeness quotient ran out. :-)
>>Internet survey data are *not* a sample survey. You cannot define the
>>real target population (which is not *your* target population but some
>>unknowable subset of it plus possibly some additional people who are
>>not part of your real population but cannot be separated). You cannot
>>evaluate multiple hits by a single user. You cannot assess sampling
>>weights because internet surveys are self-selecting, with balancing
>>components that make the self-selection likelihoods change over the
>>course of the sample period. If you don't believe this, then look at
>>the results of the All-Star game online balloting for Major League
>>I don't see the point of comparing here. Are you trying to show that
>>the internet numbers do not match the national numbers? If so, then
>>you are first going to have to gen up some artificial sampling weights
>>for the internet survey and pretend that they are valid. Then you can
>>treat the 'census' as a fixed constant and use sample survey analysis
>>to see if you hit the target.
>>Sorry to be so discouraging,
>>David L. Cassell
>Thank you for your reply and help. Let me be more specific. I am looking
>forward to your input if you have a bit more patience left.
>My sample is a clinical group. Participants are recruited through
>Polimetrix(www.polimetrix.com). They need to have certain demographic
>characteristics (gender, age, ethnicity/race, and education). I want to
>see if their health condition (categorized as 1 to 5 with 1 for poor, 5
>for excellent ), stratified by White, Black, Spanish, and others, are
>comparable or worse compared with national survey data, for example,
>MEPS, BRFSS. I think my sample should be a subset of national survey
>data. I want the targeted population of my study have characteristics
>distributions close to the US census data, so I created sampling weights
>according to, say joint distribution of gender, age, and race.
>I think I need to do a two-way anova here to see if there is an overall
>health condition difference between our sample vs. census, stratified by
>different races. If there is, then use multtest to do comparisons by
>race. But here with national survey data, I don't know how to combine
>these two data at individual level. Or maybe I can get around that by
>just using their summary statistics to do the comparison?
First, there's no way to tell if your web survey data are a subset of
BRFSS respondents or not, so you'll never be able to assess the
independence/dependence of the data points.
Second, a survey like MEPS or BRFSS has a careful sample design with
sampling weights that you can use to get point estimates. The data
from the Polimetrix survey does *not*, and cannot. There is no way
to assess true sampling weights without a defined target population,
a defined sample survey design, etc... So you're stuck. You cannot
take the web survey data and get population estimates because you
cannot compute legitimate sampling weights.
You can fudge and pretend that you have equal sampling weights
for the web survey sample, but there is no way to correct any
biases that this accrues. No, you can't fix it by raking, or by
bootstrapping, or anything. Really. Because you cannot control
the non-response rates of uncontrolled, unknowable subsets of your
real target population without completely re-doing that web survey
in a way that gives you an effective sampling design. So you can't
even tell if you have all of your target population properly sampled,
or if you accidentally got portions not covered by the BRFSS.
I recommend that you try some simple descriptive stats and graphical
data analysis without attempting to do the theoretical analysis parts.
Sorry to be such a downer,
David L. Cassell
3115 NW Norwood Pl.
Corvallis OR 97330
Booking a flight? Know when to buy with airfare predictions on MSN Travel.