Date: Tue, 21 Sep 2010 23:57:35 -0400
Reply-To: Jordan Hoolachan <jihool3670@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Jordan Hoolachan <jihool3670@GMAIL.COM>
Subject: Re: variable selection after multiple imputation
Content-Type: text/plain; charset=ISO-8859-1
Again, thank you for your thoughtful responses, Sigurd and Peter.
A quick response to Peter's comment about different models being
produced from different imputed data sets: When I said different
models I didn't mean completely different models. I just meant that,
while a core group of predictors were consistently included in every
model, a few predictors would be selected in some but not in others.
In and of itself, that is to be expected considering the variation
introduce by the imputation process (especially since I have a high
degree of missingness). It only becomes a problem when it comes time
to combine estimates across the imputed datasets and I don't have a
single model to use within MIANALYZE.
In regards to the main issue of variable selection, I've taken some of
your suggestions into consideration. Based on the correlation matrix
provided by the Corr procedure, there is no issue with high
correlation between predictors...the highest correlation of any pair
of predictors was ~0.4, with the vast majority <0.1.
Conveniently, SAS offers a 30 day trial version of its JMP software so
I spent the day reading up on recursive partitioning and learning my
way around JMP. From what I've read in that short time, recursive
partitioning is often used as a standalone modeling tool. So Sigurd,
my question to you is why did you suggest using CART/JMP to narrow
down the list of predictors and then use that shortlist within the
GLMSelect procedure when I could use tree-based modeling by itself?
Is that merely because I had expressed interest in staying within
regression because I lack knowledge in data mining techniques? I
assume that to be the reason but I just want to make sure in case I'm
missing some nuance in the use of tree-based modeling.
On Mon, Sep 20, 2010 at 5:59 PM, Peter Flom
> Jordan Hoolachan wrote
> Thanks for the responses, Dale and Sigurd. The theme to both your
> posts is that stepwise regression has many perils, a sentiment that is
> echoed pretty extensively in the literature (though there is certainly
> a literature supporting stepwise procedures as well).
> The reasons I have chosen stepwise up until this point are multi-fold.
> For one, I have a larger (150+) number of predictors with little
> theoretical guidance for choosing among them. I have sought to use
> stepwise procedures to narrow that list down. I have been using an
> alpha level of 0.157 which, when used within a stepwise procedure, has
> been cited in the literature as being asymptotically equivalent to
> using AIC. This sort of approach has been discussed in a collection
> of SUGI papers by Shtatland et al.
> I don't see how that could be true; I think the relation between AIC and the
> p to enter would depend on number of variables. But, in any case, using AIC
> as a cutoff isn't much better than using p. It tends to favor less complex
> models, but it has all the other problems of using stepwise - wrong
> p-values, wrong standard errors, wrong parameter estimates.
> Another reason I have sought to use stepwise is for feasibility
> purposes. As I mentioned, I have already imputed 10 data sets (a
> number which is on the upper end of recommended imputations but
> necessary for the degree of missingness that I'm dealing with). In
> order to use MIANALYZE, one must already know what predictors to
> include in the model statement. Unfortunately, model selection on
> each of the individual imputed data sets often yields different
> models, due to the natural variability in the data from imputation.
> My goal in trying to combine a stepwise procedure with MIANALYZE was
> to produce a single list of predictors after having taken all of the
> available data into account. I could then repeat that procedure
> multiple times, altering parameters/interaction terms/etc, which would
> result in multiple models that could then be compared using validation
> Finally, I have been using stepwise because I am unfamiliar with CART,
> data mining, and other similar techniques. More basic regression
> techniques have been the focus of my classes thus far.
> With that said, I am completely open to new approaches. I would like
> to stay within base SAS (or STATA) as that is all I have available at
> the moment.
> I don't know what's in STATA, but tress are not in Base SAS (I think they
> are in JMP).
> The fact that model selection on the different data sets yields different
> models should be a
> Giant warning that something is wrong.
> If you want to stay within regression, then I suggest you first look at a
> correlation matrix of your 150 variables. If there are pairs that are
> highly correlated (say, over .9) then eliminate 1. That should reduce some
> of the problem.
> You might also consider partial least squares regression (PROC PLS).