Date: Fri, 5 Nov 2004 13:59:40 -0800 cassell.david@EPAMAIL.EPA.GOV "SAS(r) Discussion" "David L. Cassell" Re: Weighted Standard Deviation Question To: John <6b0fa07.0411041211.7adc135c@posting.google.com> text/plain; charset=UTF-8

John <jstanmeyer@GMAIL.COM> wrote back: > Whoops... my example wasn't the one I originally had problems with. > Change the weight varible (Qty) from all 1's to all 100's, and the > resulting weighted Standard Deviation is different. From some internet > research I gather that I must first "normalize" the weights, but I > have found conflicting instructions to normalize the weights such that > the sum of the weights = 1, or to normalize the weights such that the > num of the weights = the number of data points. The latter would seem > to give the same results as a non-weighted standard deviation if the > weights are all the same, whether 1 or 100. Does the latter > normalization approach make sense?

No, in many applications you should NOT 'normalize' the weights so they add up to one. If you have a probability distribution, that would make sense. If you have sample data, the weights have a real physical meaning that you would trash by re-scaling them.

Where do your data come from? Why do you have weights of 100? What does that '100' represent?

Let's make up an example. We'll take a population of prices, rounded off to the nearest 100 (due to some reason we'll make up as we go along :-) and we have maybe 400 prices in the real population. We'd like to know what the real population mean and standard deviation are, so we take a sample of size 4. (A dumb choice, but we're cheap.) So here's an important point we must maintain: THE WEIGHTS HAVE A REAL PHYSICAL MEANING! In this case, each weight represent 100 real prices in the population. We want an estimate of the population behavior, not the sample behavior.

Now we run PROC MEANS as in your first case and get:

options nocenter nodate nonumber;

data test; input Qty Price; cards; 100 100.00 100 200.00 100 300.00 100 400.00 ; run;

proc means data=test mean std; var price; run;

The MEANS Procedure

Analysis Variable : Price

Mean Std Dev ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 250.0000000 129.0994449 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Okay, that is the sample behavior. The mean of the sample does not have the same distribution as the mean of the population. We need to fix that. And weights help us do that.

proc means data=test mean std; var price; weight qty; run;

The MEANS Procedure

Analysis Variable : Price

Mean Std Dev ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 250.0000000 1290.99 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Now the standard error is different.. because it is the standard error of something different! And the underlying assumption is that we have a sample of size 4 from an infinite population, so we really don't know much about that population.

But it still is not correct. We took a sample from a finite population and failed to use the appropriate methodologies to do the estimation. We need to use survey sampling approaches here. So let's see what happens when we use the right PROC, and information about the true population size:

proc surveymeans data=test mean std total=400; var price; weight qty; run;

The SURVEYMEANS Procedure

Data Summary

Number of Observations 4 Sum of Weights 400

Statistics

Std Error Variable Mean of Mean Std Dev ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Price 250.000000 64.226163 25690 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Now we have a more accurate estimator, because we are using correct assumptions and we provide knowledge about the size of the population so we can add in a finite population correction factor.

Bottom line: treat those weights with the respect they deserve. Don't 'normalize' unless you understand why you should do it.

HTH, David -- David Cassell, CSC Cassell.David@epa.gov Senior computing specialist mathematical statistician

Back to: Top of message | Previous page | Main SAS-L page