while i agree with most of your sugguestions to steve, i have to disagree
with you on using CART as variable selection tool. CART selects variables on
the local scale instead of global and the child splits highly depends on the
parent split, which is very unstable itself and very sensitive to the data
also, using glmselect as variable selection tool in logistic regression is
very heuristic without a sound theoretical ground.
"ALL automated variable/predictor selection programs suffer from the same
well-known generic defects" is a false claim itself. How many does "ALL"
represent and which are they? It is very dangerous to "ALL" in a statistical
On Wed, Dec 16, 2009 at 12:21 PM, Sigurd Hermansen <HERMANS1@westat.com>
> I didn't have time to reply when you posted this message. Hope that this
response will have some use to you.
> What you are proposing amounts to automated variable selection with the
c-statistic as a criterion. While I wouldn't recommend that you rely
exclusively on the c-statistic, I have demonstrated an efficient method for
computing it for a set of predictions and observed binary outcomes. See Lex
Jansen's excellent archives for a two-part paper on Evaluating Predictive
> As for techniques, I'd recommend that you look first at the JMP analog of
classification trees (recursive partitioning? CART?). Variables selected for
early "splits" will be good candidates for a predictive model. Correlations
among predictors doesn't affect CART, and CART makes good use of proxies for
missing values. CART does tend to commit very early to a hierarchy and may
not find a better model. You may need to remove predictors in early splits
and explore alternative models.
> The new GLMSELECT procedure in SAS implements stage-wise variable
selection (LAR or LASSO) implements shrinkage methods. I wouldn't start with
600 predictors. Perhaps the early split variables in a classification tree
and others that seem important a priori would give you a good start. Once
you have a potential model or several potential models specified, try a
logistic regression model and compute scores and graph the ROC.
> All automated variable/predictor selection programs suffer from the same
well-known generic defects: estimation methods optimize within an observed
sample (model optimism), misleading confidence bounds on parameter
estimates, and predictions conditioned on outcomes presumed known with
certainty. Cross-validation of models may help. Also, Efron has written
extensively recently about using expected false discovery rates to evaluate
models selected from sets of many predictors and many observations.
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
> Sent: Wednesday, December 02, 2009 9:09 AM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Screening factors for logistic regression
> I have developed 600+ potential predictors for use in a logistic
> regression model I'm working on. I want to screen each as efficiently as
> possible for predictive power (using the c-statistic). We have a brute-
> force method to generate the c-statistics (proc logistic on
> yvar=xvar_in_question, then numerically integrate the ROC curve to
> estimate), but there has to be a more straightforward (and efficient) way
> to perform this task, right?
> Also, I want to identify variables/groups of variables that are collinear,
> so I can leave out all but the most sensible one(s) (per subject matter
> knowledge). I could use PROC CORR, but that will be overwhelmed trying to
> do 600*600 combinations. Again, isn't there a better way to attack this?
> FYI - I have both SAS and JMP available. Only about 5% of the dataset can
> fit in JMP - but we'll be developing the regression there (using all
> target outcomes, and a few percent of the other records so there's a
> minimum of two non-target records per target one).
> Thanks for the guidance!
Blog : statcompute.spaces.live.com
Tough Times Never Last. But Tough People Do. - Robert Schuller