Date: Tue, 24 Oct 2006 09:21:35 -0400
Reply-To: Michael Ni <Michael.Ni@COGNIGENCORP.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Michael Ni <Michael.Ni@COGNIGENCORP.COM>
Subject: Re: Stepwise Regression
In-Reply-To: <BAY103-F1580144198FCF020115C20B0000@phx.gbl>
Content-Type: text/plain; charset=ISO-8859-1
David,
Thanks a lot for your detailed answers. I am just wondering if the
stepwise/backward/forward selection is not the right way to go, which
proc procedures shall I use to model for either prediction or
interpretation purpose?
Thanks,
Michael
David L Cassell wrote:
> Michael.Ni@cognigencorp.com wrote back:
>
>>
>> David L Cassell wrote:
>>
>> > Michael.Ni@COGNIGENCORP.COM wrote:
>> >
>> >>
>> >> Hi,
>> >>
>> >> I know that stepwise regression has some shortcomings though it has
>> been
>> >> still widely used in different industries. So my question is that
>> when
>> >> would be a good time to use stepwise regression? In what scenario?
>> >>
>> >> Thanks,
>> >> Michael
>> >
>> >
>> > The best time to use stepwise selection is when felons have had your
>> > family kidnapped, and are forcing you to do this at gunpoint. Oh
>> wait,
>> > that's a Harrison Ford movie.
>> >
>> > Really, if you want to use stepwise selection, you have to be aware
>> that
>> > it does not do what people want it to (i.e. magically come up with a
>> > 'best' set of predictor variables), and you need to go back and check
>> > the regression diagnostics for *every* *single* intermediate stage to
>> > make sure that it did not go drastically off-track because of one or
>> > more of the following:
>> >
>> > outliers
>> > leverage points
>> > non-normality of residuals
>> > heteroskedasticity
>> > non-linearities
>> > data contamination
>> > mixtures of error distributions
>> > multi-collinearity
>> > autocorrelation
>> > suppressor variables
>> > . . . .
>> >
>> > It also will mess up if you have measurement errors in your
>> regressors,
>> > because it will see the measurement error as random noise. But
>> > everything messes up on this, unless you specifically model the
>> > measurement error, and that is usually a PROC CALIS task that is
>> > outside the realm of routine regression processes.
>> >
>> > The basic issue is that people think that stepwise selection will
>> > find a 'best' model with the right number of regressors. But it
>> > will not. The formulas for stepwise selection do not actually
>> > work. There is no checking for anything that can go wrong.
>> > So, even if your data are *perfect*, 100% multivariate normal
>> > errors with no outliers and no problems anywhere, you *still*
>> > cannot depend on stepwise selection to get you where you
>> > want to go. Frustrating, eh?
>> >
>> > HTH,
>> > David
>> > --
>> > David L. Cassell
>> > mathematical statistician
>
>
>>
>> David,
>>
>> Thanks a lot for your detailed explanation. You mean the stepwise
>> ignores the distribution of error?
>
>
>
> Yes. As with other standard OLS stats, it pretends the residuals
> are normal, with set covariance and some other features. (For
> example, OLS - ordinary least squares - assumes the errors have
> equal variances, they are all independent, and they are identically
> distributed.)
>
>
>> It did not check the
>> normality/outliter/etc.?
>
>
>
> No, it does not. You have to do that.
>
>
>> Is it the only thing wrong in theory?
>
>
>
> No. In fact, everything is wrong with it in theory, since the so-called
> F cutoffs are not actually distributed as F statistics due to the
> complications
> involved in the method. You start out taking the max of a bunch of
> things. But the max of p things is not distributed as the p things are
> (unless all the p things are independent and they all have one of the
> Extreme Value distributions). So even at step 1, things go haywire.
> Then, at step 2, you are making estimates that are conditional on
> the prior step results, and so you no longer have the same distributions
> you started with. So pretneding that this is a meaningful way of
> deciding when to stop the process (instead of some kind of guideline
> that needs to be checked afterward) is just fatuous.
>
>
>> When I do
>> backward/forward selection, I usually do the residual plots/godness of
>> fits plots after I get several candidate models. I do not do it in the
>> intermediate steps. It seems that they are the same.
>
>
>
> Backward and forward selection don't work either, for the same reasons
> that stepwise selection is not valid. Remember that these methods are
> driven by the noise in your data as well as the signal, so you end up
> with R-squareds that are biased high, slopes that are biased (usually
> away from zero), t-tests that are biased (usually too high), p-values
> that
> are biased (usually low), etc. So your estimates are junk, and are so
> data-set specific that validation studies are almost always
> disappointing.
>
> You also need to check at each intermediate stage, because you can
> find the selection processes pushing you toward regressors with large
> outliers, and away from regressors with highly-correlated surrogates
> that are also in the model.
>
>
>> The only thing is
>> that stepwise just get me one model. Could you please correct me and
>> explain a little more? Thanks so much!
>>
>> Best regards,
>> Michael
>
>
>
> You need to move toward a more expert-driven model-building
> approach. It is more reliable, and more trustworthy. It gives you
> model structures which are meaningful and interpretable. It also
> gives you buy-in from your experts and your bosses, and that is
> important in other ways.
>
> HTH,
> David
> --
> David L. Cassell
> mathematical statistician
> Design Pathways
> 3115 NW Norwood Pl.
> Corvallis OR 97330
>
> _________________________________________________________________
> All-in-one security and maintenance for your PC. Get a free 90-day
> trial!
> http://clk.atdmt.com/MSN/go/msnnkwlo0050000002msn/direct/01/?href=http://www.windowsonecare.com/?sc_cid=msn_hotmail
>
|