Date: Thu, 26 Oct 2006 22:07:13 GMT
Reply-To: Paige Miller <pmiller5NOSPAM@ROCHESTER.RR.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Paige Miller <pmiller5NOSPAM@ROCHESTER.RR.COM>
Organization: Road Runner
Subject: Re: Stepwise Regression
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
On 10/26/2006 2:33 PM, Vadim Pliner wrote:
> Paige Miller wrote:
>> On 10/23/2006 6:00 PM, Vadim Pliner wrote:
>>> risking to be slaughtered on SAS-L, let me give you a scenario where I
>>> think stepwise regression could be used.
>>> 1. You are trying to predict something.
>>> 2. You have a lot of independent variables and the selection of
>>> variables presents a problem for you.
>> Partial Least Squares is a better solution
> a. What do you mean by "better"? Here is my definition of "better" in
> this context: if my objective is purely prediction and method X gives
> me closer fit to the actual values on validation data than method Y,
> then method X is better for me.
> b. I was not talking specifically about linear stepwise regression.
> AFAIK, Partial Least Squares is not applicable to the case when
> dependent variable is binary. I know there are alternatives to stepwise
> logistic regression for selecting variables that you might consider
> "better" as well, but see a. above.
I gave a reference and a summary of that reference that explains what
"better" meant. I said that Frank and Friedman showed that PLS (and
other methods) had much lower MSE on the predictions, and much lower
MSE on the regression coefficients, than Stepwise and OLS based methods.
There are variations of PLS that work on binary Y.
>>> 3. You have enough data points to split your data into a large enough
>>> training data set (where you build the model) and a large enough test
>>> or validation data set where you can select the best model.
>> Partial Least Squares still is a better solution
> See a. above again.
Okay, see my reference again.
>>> 4. You build a number of competing models, one of which is created with
>>> stepwise regression.
>> This does nothing to eliminate the drawbacks of stepwise. Lots of data
>> does not eliminate the drawbacks of stepwise. Having a large tes tdata
>> set does not eliminate the drawbacks of stepwise. Creating additional
>> models does not eliminate the drawbacks of stepwise.
> I agree, but what lots of data does is an opportunity to test which of
> the competing methods predicts best in practice on your specific data
> rather than in theory.
None of which argues in favor of using stepwise.
>>> 5. If on the set aside test data set stepwise regression gives you the
>>> best predictions, select this model.
>> So, you are saying that there are cases where, simply by random
>> chance, stepwise gives you better predictions, then this is a reason
>> to continue to use stepwise.
> This is not exactly what I was saying. Yes, a couple of times in my
> experience stepwise logistic regression outperformed competing methods
> (3 or 4 neural networks and a decision tree). I doubt it was "by random
> chance", because the sample sizes were too big to believe in chance. I
> didn't say this was a reason to continue to use stepwise, I just gave a
> scenario where you could justify the use of stepwise regression, and
> this was, as far as I remember, the OP's question.
Random chance across all possible data sets and choice of models. I
did not mean that the models were not significant or the coefficients
were not significant.
If stepwise is the best method on a given data set, this is a random
occurrence. You can't look at the dataset and how the data was
collected and decide a priori that stepwise will be superior. That's
what I meant.
But, in opposition to your scenario where you try stepwise and see if
it predicts well, I prefer methods that have been shown to be good
predictors (ie low MSE of predictions, low MSE on coefficients) in
many situations that I face. In other words, in many situations, you
can decide a priori that PLS is a better method than OLS based methods.
>>> Do I think it's realistic to expect stepwise regression can produce the
>>> best model? Yes, it can. Would you prefer a model that is theoretically
>>> sound or the one that gives you better predictions? I'd prefer the
>>> latter if prediction were my sole goal.
>> But you haven't shown that stepwise is a theoretically good way to get
>> better predictions (it is not), or that it is even a method that will
>> give you better predictions in a reasonable percentage of the cases.
>> Frank and Friedman (Technometrics, 1992 I think) showed that in the
>> situations they studied OLS based methods (including stepwise) are the
>> worst thing to use when you have many variables -- worst meaning that
>> the MSE of the predictions, and the MSE of the coefficients are very
>> very large compared to the much smaller MSEs associated with Principal
>> Components Regression, Ridge Regression and oh yes Partial Least
>> Squares Regression.
> I was NOT saying that "stepwise is a theoretically good way to get
> better predictions." On the contrary, I said that if you had two
> methods, say, X and Y, and Y is theoretically better (this is not
> stepwise, I admit) but X gives better predictions on validation data, I
> would select method X.
And I wouldn't bother ever fitting a stepwise model, when there are so
many better methods available.
It's nothing until I call it -- Bill Klem, NL Umpire
If you get the choice to sit it out or dance,
I hope you dance -- Lee Ann Womack