Date: Sun, 11 Feb 2007 22:25:57 -0800
Reply-To: David L Cassell <davidlcassell@MSN.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: David L Cassell <davidlcassell@MSN.COM>
Subject: Re: Comparing variable distribution between different groups
Content-Type: text/plain; format=flowed
hema_dave15@YAHOO.COM wrote back:
>hema <hema_dave15@YAHOO.COM> wrote:
> Hi all,
>How do i compare a variable distribution using proc freq with the CMH
>option between separate datasets...I don't know which dataset to use.
>one dataset is training dataset from the universe. other one is universe.
>So how do i compare the same variable in two different datasets
>Basically i want to check whether my sample is representative of other
>sample using certain variables one at a time.
> So i was told that cmh option in freq will do that for categorical
> I want to do this for continious variable also..
> can anyone suggest me wht can be done to achieve this
> Thanks in advance
Okay, I think you are doing the wrong thing here.
Unless you started out with a really lousy sampling plan, there is
not much point to this! If you used a lousy sampling plan, then you
will find lots of differences from your intended (target) population,
and you will not be able to correct for *all* of them. So you
would be better served by starting over with a proper sample.
If you start out with a really good sampling plan, but you try this
with, say, 100 separate variables, then guess what? At alpha=0.05,
assuming complete independence (which you wouldn't really get)
you would expect that about 5 variables would flag as different.
So are they really different? No, you're just seeing random variation
and the natural consequences of error rates. What if you get 3
variables significant? Or 8? How can we tell what the right cutoff
would be, when the variables are not going to be truly independent
so that 'assuming independence' number is not that helpful?
Answer: you're stuck without a lot more math and stats.
If you are trying to match *your* sample against someone else's
sample, then you have a lot more problems. But univariate approaches
are probably not the right way to go. In some cases, people pretend
that the sample would be fine if they just jiggled the weights a lot,
and they use what is called 'raking'. I don't like that, except in
So it would really help if you would explain in *detail* what you
are trying to do, and why, and what you mean by "representative
of other sample" here.
David L. Cassell
3115 NW Norwood Pl.
Corvallis OR 97330
FREE online classifieds from Windows Live Expo – buy and sell with people