Date: Thu, 3 Feb 2000 20:04:07 -0300
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: "Hector E. Maletta" <hmaletta@OVERNET.COM.AR>
Subject: Re: Weighting cases...
Content-Type: text/plain; charset=us-ascii
SPSS weights cases as a matter of course. In every SPSS data file there
is a hidden variable called $WEIGHT, whose default value is 1 for all
cases. For every statistical procedure, SPSS first multiplies sample
values by the value of $WEIGHT, then proceeds with the rest of the
The default value of $WEIGHT can be altered by designing any variable as
a weighting variable. To do this, you should have an appropriate
weighting variable in your file (on which more below). Once you have it,
you weight by that variable by going to the DATA - WEIGHT CASES menu
option, or (if you prefer) issuing the following command in a syntax
WEIGHT BY X.
(where X stands for your weighting variable).
The weighting remains in force until you replace X by some other
variable, or you return to the default weighting by means of the command
WEIGHT OFF (or the appropriate choice in the same menu option).
Now to the weights.
Ordinarily, such weights are the reciprocals of sampling ratios. If your
sample for the ith category is n(i) and the size of the corresponding
population is N(i), the sampling ratio is n(i)/N(i) and the weight is
If you adopt this X as your weighting variable, all the observed values
will be multiplied by X. Thus, any frequency table or crosstabulation,
for instance, would yield a total of N=sum(N(i)) instead of n=sum(n(i)).
This also applies to other procedures such as regression: each case
counts for X(i) cases. This approach implies that the sample is
representative, i.e., that results from the sample can be extended to
the rest of the population within each subpopulation considered (where
"subpopulation" means here "population items with the same sampling
probability)". If, on the contrary, non response were an effect of some
special characteristic of non respondents, then the sample of
respondents would not be representative of non respondents. In other
words, this applies only to random sampling. I do not know whether this
is your case.
If your stratifying variable (in your case, the so-called "categories")
capture the main sources of variation in your variables, you need not
bother yourself with "weighting for each variable".
A final word on statistical significance. SPSS computes statistical
significance based on the WEIGHTED number of cases. If you apply a
weighting variable such as X, the weighted number of cases is expanded
from n to N. Consequently, SPSS is fooled into believing that your
sample is larger than it actually is, and thus yields an overestimate of
the true significance (or an underestimate of the sampling error).
To avoid this danger, SPSS cannot offer a thorough solution, since all
its procedures assumes data come from a simple random sampling process.
For complex samples involving variable sampling ratios, the proper
software is WesVar Complex Samples (also distributed by SPSS).
However, there is an approximate solution at hand. You may preserve the
different RELATIVE weight of your various "categories" or subsamples,
while avoiding the unwelcome expansion (or ABSOLUTE weighting) that
converts n into N. This is achieved by using a new weighting variable W
= X * n/N. This new variable yields a total frequency count of n (not
N), but preserves the differential weighting in relative terms. If the
sampling model only involves stratification (as it seems to be your
case) this is a good enough solution. If the sampling model also
involves clustering (i.e. a selection of subpopulations as a first step
before selecting cases within each selected subpopulation), the above
solution may overestimate the true significance.
Universidad del Salvador
Buenos Aires, Argentina
Aron Johnson wrote:
> I have never done this before and need a hand if you get a chance.
> I've got data from two schools, a HS ans JrHS, divided into 4 categories, 2 each for each school. (just call em cat1HS, cat2HS, cat1JrHS, cat2JrHS).
> There are 40 variables measuring attitudes and opinions.
> The problem is that the number of returned surveys do not match the number of students in each grade. Here is the breakdown:
> 7th grade: 395 students, 346 surveys (88% return)
> 8th grade: 346 students, 284 surveys (82% return)
> 9th grade: 390 students, 324 surveys (83% return)
> 10th grade: 583 students, 417 surveys (72% return)
> 11th grade: 527 students, 381 surveys (72% return)
> 12th grade: 471 students, 316 surveys (67% return)
> As you can see, both the populations and samples are not equal, and therefore I believe that I need to weight the cases in order to be able to do any direct comparisons b/w the groups. Unfortunately i'm not certain how it works. If I weight cases it asks me for a single frequency variable but I want to weight the cases for every variable don't I? Especially if i'm comparing means or perfoming correlations. I cant really compare the means of two groups if I can only choose one variable to weight, that would mean I can only weight, for example, the 9th graders, but not the 10th graders.
> Maybe i'm way off here. Could someone please give me a hand with this?
> Thank you very much.
> Aron Johnson