LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (April 2005)Back to main SPSSX-L pageJoin or leave SPSSX-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Mon, 11 Apr 2005 15:09:18 -0400
Reply-To:     Richard Ristow <wrristow@mindspring.com>
Sender:       "SPSSX(r) Discussion" <SPSSX-L@LISTSERV.UGA.EDU>
From:         Richard Ristow <wrristow@mindspring.com>
Subject:      Re: Quadratic fit
Comments: To: Massimiliano Molinari <mm488@cam.ac.uk>
In-Reply-To:  <Prayer.1.0.13.0504111715420.5321@hermes-1.csi.cam.ac.uk>
Content-Type: text/plain; charset="us-ascii"; format=flowed

At 12:15 PM 4/11/2005, Massimiliano Molinari wrote:

>I'm trying to solve a regression problem, in which I have 26 >independent variables and 1 dependent variable. My aim is to reduce >the number of independent variables. In SPSS, using analyse - >regression - linear I can select, for instance, a forward selection >method and obtain a ranking of the variables. However, in my case a >quadratic model would be more efficient, which contains not only the >linear terms, but also the squared terms and all the double products: >y = a0 + a1*X(1) +....+a26*X(26) + > + a27*X(1)^2 + ... + a52*X(26)^2 + > + a53*X(1)*X(2) + ... a77*X(1)*X(26) + > + a78*X(2)*X(3) + ... a101*X(2)*X(26) + > ... + a377*X(25)*X(26). >I was wondering if there is a way to implement such a model in SPSS.

Ouch. Hector Maletta has told you how, correctly. But this model can get you in a LOT of trouble.

The quadratic model has, let's see, n*(n+1)/2 = 26*27/2 = 351 quadratic terms plus................. 26 linear terms ----------------------------------------- ......................377 terms.

At 5 cases per variable (and many would advise at least twice that many), that's 1,885 cases, minimum. I'd say that's a BARE minimum, with a model this complex; I'd be very doubtful with less than about 10 times that, say 20,000 cases.

>Most importantly, I'm not sure whether this procedure is statistically >[valid]

You were asking whether to include second-order terms (squares and cross-products) in the forward selection. I'll go a little farther:

Forward selection may be unwise altogether. The method has fallen into disfavor, because it can be a "fishing expedition": in effect, do a lot of statistical tests and report the significant results. Remember: one in 20 is significant at the .05 level, on totally random data. A forward-selected model may show a good F-statistics when the model means nothing, because the F-test misses the 'ghost' degrees of freedom from the variables that were tested and rejected. A model forward-selected from 26 variables is suspect; one forward-selected from 377 variables is very doubtful.

On the other hand,

>should the variable selection be carried out using the linear model >only?

I'm not sure what to say about this. Selecting from a smaller pool (26 variables instead of 377) may be less subject to the criticism above. But if you expect the quadratic terms to make an important contribution to the model, selecting based only on linear terms does lose something: it's quite possible for quadratic terms to be contributors when the corresponding linear terms are not.

Three other points:

. Forward selection or no, it's commonly considered that if any second-order term (square or cross-product) is in the model, the linear terms (the variable or variables going into the second-order term) must also be in, even if they test non-significant.

. We don't know the subject matter of your study. But 26 variables is a lot of variables. Unless you theoretical underpinning is quite strong, it may be very hard to interpret the model you get; to tell what it means.

. Finally, second-order terms can be very highly correlated with the linear terms in the same variables. It doesn't take much: if all observed values are positive, you'll see it. The correlation can easily throw off a multiple regression model: two highly correlated independent variables can test non-significant in the model, when the two actually make a large contribution. There are ways around this, from re-centering your data to computing orthogonal polynomials. Just be sure you've allowed for this in any model with second-order terms; especially, a very complex one like yours.


Back to: Top of message | Previous page | Main SPSSX-L page