Date: Wed, 23 Sep 2009 09:33:38 -0400 Sigurd Hermansen "SAS(r) Discussion" Sigurd Hermansen Re: Bootstrap for shrinkage and optimism To: Daniel <8389ae1e-3586-4fab-a001-4e585f9700c1@p15g2000vbl.googlegroups.com> text/plain; charset="us-ascii"

Daniel: Let me say from the beginning that I understand the difficulty of trying to implement a repetitive process that involves many complex steps. I've included an example of an evaluation of a similar predictive model in http://analytics.ncsu.edu/sesug/2008/MPSF-072.pdf . I'd definitely go with Cassell's method. The sage and helpful advice that you have received from the 'L's own Batman and from oloolo should help you tailor a solution to your requirements.

Several aspects of survey and observational data make predictive modeling especially difficult. Non-response bias weakens survey results. Correlation of predictors and missing values confound modeling of observational data. Data collected over any appreciable interval of time typically suffers from serial correlation of prediction errors and external influences. I've recently posted an example of a predictive model of the boosting persuasion that attempts to work around correlation and missing value issues: http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0909B&L=sas-l&P=R23069&D=1&H=0&O=D&T=1 .

When I observe what the real statisticians at Westat are doing, I have to wonder about the limits of resampling from one sample. Perhaps another sample of selected variables would help restrict parameter estimates to a more likely range of values and prevent overfitting (model optimism). For this list, better statistical modeling methods seems to be a topic of continuing interest, much like another burning question: where in the world is David Cassell when we need him. S -----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Daniel Sent: Tuesday, September 22, 2009 10:16 AM To: SAS-L@LISTSERV.UGA.EDU Subject: Bootstrap for shrinkage and optimism

Good morning All,

I am developing a predictive model (outcome binary) following the methodology outlined in "Clinical prediction models" by Steyerberg, or that in StatMed vol. 15 pp. 361-387 (Multivariable prognostic models: Issues in developing models, evaluating assumtions and adequacy, and measuring and reducing errors). I am using bootstrap to obtain measures of shrinkage and optimism to correct my regression coefficients and goodness of fit (GOF) measures (respectively) for overfitting. The steps include:

1. Obtain X bootstrap samples with replacement, of the same size as the original data 2. Use each sample to model the outcome using, in our case, a fixed set of covariates. Get GOF measures of interest 3. Score the original data with the model obtained in 2. Obtain GOF measures of interest on the model applied to the original data ... some additional steps irrelevant to my question

I've used David Cassell's advice to program, in very few lines, steps 1 and 2, by building a dataset with my X bootstrap samples with replacement, and then running PROC LOGISTIC with the "BY REPLICATE" statement.

To score the original data using each of my X models, I used the OUTEST= option in my PROC LOGISTIC run of step 2, and I then run a second PROC LOGISTIC, this time with the INEST= option. But for this to work the way I want, I need to use a "BY REPLICATE" statement and this means that I need to have to create a dataset with my original data repeated X times, each time with a new value of REPLICATE. This allows me to avoid the do loop. The negative aspect (though it might be mitigated by the efficiency of using the BY statement) is that I need to create this dataset and depending on the value of X, it can get quite large. Can you think of other ways this could be done as efficiently as steps 1 and 2 (perhaps from your own experiences)?

Thank you.

Daniel

Back to: Top of message | Previous page | Main SAS-L page