Date: Mon, 11 Apr 2005 15:09:18 -0400
Reply-To: Richard Ristow <wrristow@mindspring.com>
Sender: "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From: Richard Ristow <wrristow@mindspring.com>
Subject: Re: Quadratic fit
In-Reply-To: <Prayer.1.0.13.0504111715420.5321@hermes-1.csi.cam.ac.uk>
Content-Type: text/plain; charset="us-ascii"; format=flowed
At 12:15 PM 4/11/2005, Massimiliano Molinari wrote:
>I'm trying to solve a regression problem, in which I have 26
>independent variables and 1 dependent variable. My aim is to reduce
>the number of independent variables. In SPSS, using analyse -
>regression - linear I can select, for instance, a forward selection
>method and obtain a ranking of the variables. However, in my case a
>quadratic model would be more efficient, which contains not only the
>linear terms, but also the squared terms and all the double products:
>y = a0 + a1*X(1) +....+a26*X(26) +
> + a27*X(1)^2 + ... + a52*X(26)^2 +
> + a53*X(1)*X(2) + ... a77*X(1)*X(26) +
> + a78*X(2)*X(3) + ... a101*X(2)*X(26) +
> ... + a377*X(25)*X(26).
>I was wondering if there is a way to implement such a model in SPSS.
Ouch. Hector Maletta has told you how, correctly. But this model can
get you in a LOT of trouble.
The quadratic model has, let's see,
n*(n+1)/2 = 26*27/2 = 351 quadratic terms
plus................. 26 linear terms
-----------------------------------------
......................377 terms.
At 5 cases per variable (and many would advise at least twice that
many), that's 1,885 cases, minimum. I'd say that's a BARE minimum, with
a model this complex; I'd be very doubtful with less than about 10
times that, say 20,000 cases.
>Most importantly, I'm not sure whether this procedure is statistically
>[valid]
You were asking whether to include second-order terms (squares and
cross-products) in the forward selection. I'll go a little farther:
Forward selection may be unwise altogether. The method has fallen into
disfavor, because it can be a "fishing expedition": in effect, do a lot
of statistical tests and report the significant results. Remember: one
in 20 is significant at the .05 level, on totally random data. A
forward-selected model may show a good F-statistics when the model
means nothing, because the F-test misses the 'ghost' degrees of freedom
from the variables that were tested and rejected. A model
forward-selected from 26 variables is suspect; one forward-selected
from 377 variables is very doubtful.
On the other hand,
>should the variable selection be carried out using the linear model
>only?
I'm not sure what to say about this. Selecting from a smaller pool (26
variables instead of 377) may be less subject to the criticism above.
But if you expect the quadratic terms to make an important contribution
to the model, selecting based only on linear terms does lose something:
it's quite possible for quadratic terms to be contributors when the
corresponding linear terms are not.
Three other points:
. Forward selection or no, it's commonly considered that if any
second-order term (square or cross-product) is in the model, the linear
terms (the variable or variables going into the second-order term) must
also be in, even if they test non-significant.
. We don't know the subject matter of your study. But 26 variables is a
lot of variables. Unless you theoretical underpinning is quite strong,
it may be very hard to interpret the model you get; to tell what it
means.
. Finally, second-order terms can be very highly correlated with the
linear terms in the same variables. It doesn't take much: if all
observed values are positive, you'll see it. The correlation can easily
throw off a multiple regression model: two highly correlated
independent variables can test non-significant in the model, when the
two actually make a large contribution. There are ways around this,
from re-centering your data to computing orthogonal polynomials. Just
be sure you've allowed for this in any model with second-order terms;
especially, a very complex one like yours.