Date: Mon, 20 Sep 2010 17:38:54 0400
ReplyTo: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SASL@LISTSERV.UGA.EDU>
From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject: Re: variable selection after multiple imputation
InReplyTo: <AANLkTik+vxLKWX+QphG2VSySp5ZyUpvCVEsfaQw1GZoD@mail.gmail.com>
ContentType: text/plain; charset="usascii"
Jordan:
If indeed you do have 10 imputed datasets, you'll save a lot of time and effort by adding a replicate number {1..10} to each and combining them into one SAS dataset. David Cassell's now classic "Don't be Loopy..." paper explains how one should use BY statements to control processing of replicates. Also see his highly relevant papers on resampling methods.
Seems to me that you should be asking yourself "Do I have any reason to believe that variable X will contribute to a good prediction of Y?" If no, then don't include it as a predictor.
CART (or TreeNet or JMP recursive partitioning) tend to be robust in the face of highly correlated predictors and missing values of predictors. If you must eliminate variables, you'll find classification and regression methods easy to learn and use and, though not infallible, a quicker and better basis for variable selection than stepwise regression.
Whether you optimize AIC or any other onedimensional measure of fit or predictive accuracy, you'll still have suboptimized one regression diagnosis over an alleged "truth sample". An ensemble of model selection methods seems more appropriate to me than any single method. The common subset of variables selected by CART (before imputing missing values but after tree pruning) and GLMSelect will likely select predictors parsimoniously. Adding a few additional promising predictors with low levels of correlation to the parsimonious set may improve predictions of the model (as judged by a holdout sample, crossvalidation, or cstatistic). All can be done in SAS or R.
S
Original Message
From: SAS(r) Discussion [mailto:SASL@LISTSERV.UGA.EDU] On Behalf Of Jordan Hoolachan
Sent: Monday, September 20, 2010 1:30 PM
To: SASL@LISTSERV.UGA.EDU
Subject: Re: variable selection after multiple imputation
Thanks for the responses, Dale and Sigurd. The theme to both your
posts is that stepwise regression has many perils, a sentiment that is
echoed pretty extensively in the literature (though there is certainly
a literature supporting stepwise procedures as well).
The reasons I have chosen stepwise up until this point are multifold.
For one, I have a larger (150+) number of predictors with little
theoretical guidance for choosing among them. I have sought to use
stepwise procedures to narrow that list down. I have been using an
alpha level of 0.157 which, when used within a stepwise procedure, has
been cited in the literature as being asymptotically equivalent to
using AIC. This sort of approach has been discussed in a collection
of SUGI papers by Shtatland et al.
Another reason I have sought to use stepwise is for feasibility
purposes. As I mentioned, I have already imputed 10 data sets (a
number which is on the upper end of recommended imputations but
necessary for the degree of missingness that I'm dealing with). In
order to use MIANALYZE, one must already know what predictors to
include in the model statement. Unfortunately, model selection on
each of the individual imputed data sets often yields different
models, due to the natural variability in the data from imputation.
My goal in trying to combine a stepwise procedure with MIANALYZE was
to produce a single list of predictors after having taken all of the
available data into account. I could then repeat that procedure
multiple times, altering parameters/interaction terms/etc, which would
result in multiple models that could then be compared using validation
techniques.
Finally, I have been using stepwise because I am unfamiliar with CART,
data mining, and other similar techniques. More basic regression
techniques have been the focus of my classes thus far.
With that said, I am completely open to new approaches. I would like
to stay within base SAS (or STATA) as that is all I have available at
the moment.
Thank you for your consideration,
Jordan
On Mon, Sep 20, 2010 at 1:03 AM, Dale McLerran <stringplayer_2@yahoo.com> wrote:
> Variable selection using stepwise regression is not well thought
> of even when one has complete data. Why one would think that
> variable selection employing stepwise regression methods would
> work well for the case where missing data are imputed is really
> beyond my comprehension.
>
> It would be possible to implement such a strategy in SAS. You
> are correct that one would need to employ a macro in order to
> effectively implement a onestep stepwise selection model,
> fit the selected model to each imputation set, and then
> combine imputation results using MIANALYZE. Subsequently,
> you would fit another stepwise selection model which potentially
> adds one variable to a set of already selected predictors,
> stopping when some criterion is met.
>
> A sketch of how this would proceed would be something like
> the following:
>
>
> %macro mi_stepwise(data=mydata, response=, preds=, stop=);
>
> %let in=; /* list of vars selected by stepwise */
> %let n_in=0; /* # of vars selected by stepwise */
>
> %let candidates=&preds; /* list of vars which could be selected */
> %let n_candidates=0; /* # of var which could be selected */
> %do %while(%scan(&candidates,%eval(&n_candidates+1))^=%str());
> %let n_candidates=%eval(&n_candidates+1);
> %end;
>
> %do i=1 %to &n_candidates %until( <function of stop condition is met> );
> ods output ModelBuildingSummary=ModelSelection;
> proc logistic data=&data;
> model &response = &in &candidates /
> selection=forward
> %if &n_in>0 %then include=&n_in;
> run;
>
> <Data step code to determine what variables were added to the>
> <model, if any, by the stepwise model specified above. If a >
> <new variable is added, then update the set of predictors >
> <in both the IN set and the CANDIDATE set, and update the >
> <number of variables in the IN set. Don't update the number >
> <of variables in the CANDIDATE set, because that number is >
> <the basis for the macro do loop we are in. >
>
> %if <new variable added to set IN> %then %do;
> ods output parameterEstimates=Parms
> covb=CovParms;
> proc logistic data=&data;
> by imputation_set;
> model &response = &in / covb;
> run;
>
> proc mianalyze ...
> %end;
> %end;
> %mend;
>
>
> This is a very rough outline. Quite a bit of work remains to
> flesh out all of the details. But hopefully it is enough to
> get you headed in the right direction.
>
> Dale
>
> 
> Dale McLerran
> Fred Hutchinson Cancer Research Center
> mailto: dmclerra@NO_SPAMfhcrc.org
> Ph: (206) 6672926
> Fax: (206) 6675977
> 
>
>
>  On Sat, 9/18/10, Jordan Hoolachan <jihool3670@GMAIL.COM> wrote:
>
>> From: Jordan Hoolachan <jihool3670@GMAIL.COM>
>> Subject: variable selection after multiple imputation
>> To: SASL@LISTSERV.UGA.EDU
>> Date: Saturday, September 18, 2010, 8:51 PM
>> Dear all,
>>
>> I am in the process of building a logistic prediction model
>> for a
>> large insurancerelated data set that has a lot of missing
>> data. To
>> handle the missing data issue, I have produced 10 imputed
>> datasets
>> using IVEware. Now I must perform variable selection
>> to improve the
>> generalizability of the model. In "How should
>> variable selection be
>> performed with multiply imputed data?" written for Stata,
>> Wood et al.
>> (2008) identify a model selection approach (the "RR
>> approach") that
>> utilizes Rubin's Rules for estimating parameters and
>> standard errors
>> across imputed data sets. Specifically, "each model
>> selection step
>> involves fitting the model under the consideration of all
>> data sets
>> and combining the estimates across imputed data sets.
>> The only
>> information they provide in regards to actually doing this
>> in Stata is
>> the following: "For the RR method, stepwise (a
>> command in Stata
>> equivalent to specifying a SELECTION= option within a
>> regression
>> procedure in SAS) was modified to use the Wald test
>> statistics from
>> micombine (equivalent to MIANALYZE in SAS)."
>>
>> From that brief description, it seems like I need to code
>> an iterative
>> procedure in which the results of each step of a stepwise
>> LOGISTIC
>> procedure are fed into MIANALYZE which the combines the
>> estimates
>> across each of the 10 data sets before feeding the Wald
>> test statistic
>> back to LOGISTIC so the next step in the stepwise procedure
>> can be
>> performed.
>>
>> I am only an intermediate SAS user on my best days so I'm
>> not really
>> sure where to start with this. Any advice on setting
>> up a macro of
>> this sort? Is it even possible in SAS? If you have
>> had experience
>> with analyzing imputed data in the past, do you suggest any
>> other
>> approaches to variable selection?
>>
>> If interested, here is the web address to the Wood et al.
>> paper:
>> http://onlinelibrary.wiley.com/doi/10.1002/sim.3177/abstract
>> Unfortunately, access to the full .pdf if only granted if
>> you have a
>> subscription...I couldn't find a location that made the
>> .pdf available
>> to everyone.
>>
>> Thank you for the consideration!
>> Jordan
>>
>
