| Date: | Fri, 5 Nov 2004 13:59:40 -0800 |
| Reply-To: | cassell.david@EPAMAIL.EPA.GOV |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | "David L. Cassell" <cassell.david@EPAMAIL.EPA.GOV> |
| Subject: | Re: Weighted Standard Deviation Question |
|
| In-Reply-To: | <6b0fa07.0411041211.7adc135c@posting.google.com> |
| Content-type: | text/plain; charset=UTF-8 |
John <jstanmeyer@GMAIL.COM> wrote back:
> Whoops... my example wasn't the one I originally had problems with.
> Change the weight varible (Qty) from all 1's to all 100's, and the
> resulting weighted Standard Deviation is different. From some internet
> research I gather that I must first "normalize" the weights, but I
> have found conflicting instructions to normalize the weights such that
> the sum of the weights = 1, or to normalize the weights such that the
> num of the weights = the number of data points. The latter would seem
> to give the same results as a non-weighted standard deviation if the
> weights are all the same, whether 1 or 100. Does the latter
> normalization approach make sense?
No, in many applications you should NOT 'normalize' the weights so
they add up to one. If you have a probability distribution, that
would make sense. If you have sample data, the weights have a real
physical meaning that you would trash by re-scaling them.
Where do your data come from? Why do you have weights of 100? What
does that '100' represent?
Let's make up an example. We'll take a population of prices, rounded
off to the nearest 100 (due to some reason we'll make up as we go along
:-)
and we have maybe 400 prices in the real population. We'd like to
know what the real population mean and standard deviation are, so we
take a sample of size 4. (A dumb choice, but we're cheap.) So here's
an important point we must maintain: THE WEIGHTS HAVE A REAL PHYSICAL
MEANING! In this case, each weight represent 100 real prices in the
population. We want an estimate of the population behavior, not the
sample behavior.
Now we run PROC MEANS as in your first case and get:
options nocenter nodate nonumber;
data test;
input Qty Price; cards;
100 100.00
100 200.00
100 300.00
100 400.00
;
run;
proc means data=test mean std;
var price;
run;
The MEANS Procedure
Analysis Variable : Price
Mean Std Dev
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
250.0000000 129.0994449
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Okay, that is the sample behavior. The mean of the sample
does not have the same distribution as the mean of the
population. We need to fix that. And weights help us do that.
proc means data=test mean std;
var price;
weight qty;
run;
The MEANS Procedure
Analysis Variable : Price
Mean Std Dev
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
250.0000000 1290.99
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Now the standard error is different.. because it is the
standard error of something different! And the underlying
assumption is that we have a sample of size 4 from an infinite
population, so we really don't know much about that population.
But it still is not correct. We took a sample from a finite
population and failed to use the appropriate methodologies to
do the estimation. We need to use survey sampling approaches
here. So let's see what happens when we use the right PROC,
and information about the true population size:
proc surveymeans data=test mean std total=400;
var price;
weight qty;
run;
The SURVEYMEANS Procedure
Data Summary
Number of Observations 4
Sum of Weights 400
Statistics
Std Error
Variable Mean of Mean Std Dev
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Price 250.000000 64.226163 25690
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Now we have a more accurate estimator, because we are using
correct assumptions and we provide knowledge about the size of
the population so we can add in a finite population correction
factor.
Bottom line: treat those weights with the respect they deserve.
Don't 'normalize' unless you understand why you should do it.
HTH,
David
--
David Cassell, CSC
Cassell.David@epa.gov
Senior computing specialist
mathematical statistician
|