LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (December 2009, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Sun, 27 Dec 2009 21:10:48 -0500
Reply-To:   Wensui Liu <liuwensui@GMAIL.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Wensui Liu <liuwensui@GMAIL.COM>
Subject:   Re: Screening factors for logistic regression
Comments:   To: Sigurd Hermansen <HERMANS1@westat.com>
In-Reply-To:   <FE10F31634E7F34B87AA143D596085413E696747@EX-CMS01.westat.com>
Content-Type:   text/plain; charset=ISO-8859-1

sigurd while i agree with most of your sugguestions to steve, i have to disagree with you on using CART as variable selection tool. CART selects variables on the local scale instead of global and the child splits highly depends on the parent split, which is very unstable itself and very sensitive to the data structure.

also, using glmselect as variable selection tool in logistic regression is very heuristic without a sound theoretical ground.

"ALL automated variable/predictor selection programs suffer from the same well-known generic defects" is a false claim itself. How many does "ALL" represent and which are they? It is very dangerous to "ALL" in a statistical world.

On Wed, Dec 16, 2009 at 12:21 PM, Sigurd Hermansen <HERMANS1@westat.com> wrote: > > Steve: > I didn't have time to reply when you posted this message. Hope that this response will have some use to you. > > What you are proposing amounts to automated variable selection with the c-statistic as a criterion. While I wouldn't recommend that you rely exclusively on the c-statistic, I have demonstrated an efficient method for computing it for a set of predictions and observed binary outcomes. See Lex Jansen's excellent archives for a two-part paper on Evaluating Predictive Models: > http://www.lexjansen.com/cgi-bin/xsl_transform.php?x=sesug2008&s=sesug&c=sesug > > As for techniques, I'd recommend that you look first at the JMP analog of classification trees (recursive partitioning? CART?). Variables selected for early "splits" will be good candidates for a predictive model. Correlations among predictors doesn't affect CART, and CART makes good use of proxies for missing values. CART does tend to commit very early to a hierarchy and may not find a better model. You may need to remove predictors in early splits and explore alternative models. > > The new GLMSELECT procedure in SAS implements stage-wise variable selection (LAR or LASSO) implements shrinkage methods. I wouldn't start with 600 predictors. Perhaps the early split variables in a classification tree and others that seem important a priori would give you a good start. Once you have a potential model or several potential models specified, try a logistic regression model and compute scores and graph the ROC. > > All automated variable/predictor selection programs suffer from the same well-known generic defects: estimation methods optimize within an observed sample (model optimism), misleading confidence bounds on parameter estimates, and predictions conditioned on outcomes presumed known with certainty. Cross-validation of models may help. Also, Efron has written extensively recently about using expected false discovery rates to evaluate models selected from sets of many predictors and many observations. > S > > > -----Original Message----- > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Steven Raimi > Sent: Wednesday, December 02, 2009 9:09 AM > To: SAS-L@LISTSERV.UGA.EDU > Subject: Screening factors for logistic regression > > I have developed 600+ potential predictors for use in a logistic > regression model I'm working on. I want to screen each as efficiently as > possible for predictive power (using the c-statistic). We have a brute- > force method to generate the c-statistics (proc logistic on > yvar=xvar_in_question, then numerically integrate the ROC curve to > estimate), but there has to be a more straightforward (and efficient) way > to perform this task, right? > > Also, I want to identify variables/groups of variables that are collinear, > so I can leave out all but the most sensible one(s) (per subject matter > knowledge). I could use PROC CORR, but that will be overwhelmed trying to > do 600*600 combinations. Again, isn't there a better way to attack this? > > FYI - I have both SAS and JMP available. Only about 5% of the dataset can > fit in JMP - but we'll be developing the regression there (using all > target outcomes, and a few percent of the other records so there's a > minimum of two non-target records per target one). > > Thanks for the guidance! > Steve

-- ============================== WenSui Liu Blog : statcompute.spaces.live.com Tough Times Never Last. But Tough People Do. - Robert Schuller ==============================


Back to: Top of message | Previous page | Main SAS-L page