LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (October 2006, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Tue, 24 Oct 2006 09:21:35 -0400
Reply-To:     Michael Ni <Michael.Ni@COGNIGENCORP.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Michael Ni <Michael.Ni@COGNIGENCORP.COM>
Subject:      Re: Stepwise Regression
Comments: To: David L Cassell <davidlcassell@MSN.COM>
In-Reply-To:  <BAY103-F1580144198FCF020115C20B0000@phx.gbl>
Content-Type: text/plain; charset=ISO-8859-1

David,

Thanks a lot for your detailed answers. I am just wondering if the stepwise/backward/forward selection is not the right way to go, which proc procedures shall I use to model for either prediction or interpretation purpose?

Thanks, Michael

David L Cassell wrote:

> Michael.Ni@cognigencorp.com wrote back: > >> >> David L Cassell wrote: >> >> > Michael.Ni@COGNIGENCORP.COM wrote: >> > >> >> >> >> Hi, >> >> >> >> I know that stepwise regression has some shortcomings though it has >> been >> >> still widely used in different industries. So my question is that >> when >> >> would be a good time to use stepwise regression? In what scenario? >> >> >> >> Thanks, >> >> Michael >> > >> > >> > The best time to use stepwise selection is when felons have had your >> > family kidnapped, and are forcing you to do this at gunpoint. Oh >> wait, >> > that's a Harrison Ford movie. >> > >> > Really, if you want to use stepwise selection, you have to be aware >> that >> > it does not do what people want it to (i.e. magically come up with a >> > 'best' set of predictor variables), and you need to go back and check >> > the regression diagnostics for *every* *single* intermediate stage to >> > make sure that it did not go drastically off-track because of one or >> > more of the following: >> > >> > outliers >> > leverage points >> > non-normality of residuals >> > heteroskedasticity >> > non-linearities >> > data contamination >> > mixtures of error distributions >> > multi-collinearity >> > autocorrelation >> > suppressor variables >> > . . . . >> > >> > It also will mess up if you have measurement errors in your >> regressors, >> > because it will see the measurement error as random noise. But >> > everything messes up on this, unless you specifically model the >> > measurement error, and that is usually a PROC CALIS task that is >> > outside the realm of routine regression processes. >> > >> > The basic issue is that people think that stepwise selection will >> > find a 'best' model with the right number of regressors. But it >> > will not. The formulas for stepwise selection do not actually >> > work. There is no checking for anything that can go wrong. >> > So, even if your data are *perfect*, 100% multivariate normal >> > errors with no outliers and no problems anywhere, you *still* >> > cannot depend on stepwise selection to get you where you >> > want to go. Frustrating, eh? >> > >> > HTH, >> > David >> > -- >> > David L. Cassell >> > mathematical statistician > > >> >> David, >> >> Thanks a lot for your detailed explanation. You mean the stepwise >> ignores the distribution of error? > > > > Yes. As with other standard OLS stats, it pretends the residuals > are normal, with set covariance and some other features. (For > example, OLS - ordinary least squares - assumes the errors have > equal variances, they are all independent, and they are identically > distributed.) > > >> It did not check the >> normality/outliter/etc.? > > > > No, it does not. You have to do that. > > >> Is it the only thing wrong in theory? > > > > No. In fact, everything is wrong with it in theory, since the so-called > F cutoffs are not actually distributed as F statistics due to the > complications > involved in the method. You start out taking the max of a bunch of > things. But the max of p things is not distributed as the p things are > (unless all the p things are independent and they all have one of the > Extreme Value distributions). So even at step 1, things go haywire. > Then, at step 2, you are making estimates that are conditional on > the prior step results, and so you no longer have the same distributions > you started with. So pretneding that this is a meaningful way of > deciding when to stop the process (instead of some kind of guideline > that needs to be checked afterward) is just fatuous. > > >> When I do >> backward/forward selection, I usually do the residual plots/godness of >> fits plots after I get several candidate models. I do not do it in the >> intermediate steps. It seems that they are the same. > > > > Backward and forward selection don't work either, for the same reasons > that stepwise selection is not valid. Remember that these methods are > driven by the noise in your data as well as the signal, so you end up > with R-squareds that are biased high, slopes that are biased (usually > away from zero), t-tests that are biased (usually too high), p-values > that > are biased (usually low), etc. So your estimates are junk, and are so > data-set specific that validation studies are almost always > disappointing. > > You also need to check at each intermediate stage, because you can > find the selection processes pushing you toward regressors with large > outliers, and away from regressors with highly-correlated surrogates > that are also in the model. > > >> The only thing is >> that stepwise just get me one model. Could you please correct me and >> explain a little more? Thanks so much! >> >> Best regards, >> Michael > > > > You need to move toward a more expert-driven model-building > approach. It is more reliable, and more trustworthy. It gives you > model structures which are meaningful and interpretable. It also > gives you buy-in from your experts and your bosses, and that is > important in other ways. > > HTH, > David > -- > David L. Cassell > mathematical statistician > Design Pathways > 3115 NW Norwood Pl. > Corvallis OR 97330 > > _________________________________________________________________ > All-in-one security and maintenance for your PC. Get a free 90-day > trial! > http://clk.atdmt.com/MSN/go/msnnkwlo0050000002msn/direct/01/?href=http://www.windowsonecare.com/?sc_cid=msn_hotmail >


Back to: Top of message | Previous page | Main SAS-L page