Date:  Tue, 22 Sep 2009 10:31:47 0700 
ReplyTo:  Daniel <daniel.biostatistics@GMAIL.COM> 
Sender:  "SAS(r) Discussion" <SASL@LISTSERV.UGA.EDU> 
From:  Daniel <daniel.biostatistics@GMAIL.COM> 
Organization:  http://groups.google.com 
Subject:  Re: Bootstrap for shrinkage and optimism 

ContentType:  text/plain; charset=ISO88591 
Thank you very much to Data _null_; and oloolo for their replies.
Data _null_, you are correct indeed, what I did want was to replicate
the entire original dataset X times.
oloolo: PROC SCORE is a very interesting suggestion. The only issue is
that unlike PROC LOGISTIC, it won't automatically compute goodness of
fit measures, correct? I could definitely program them manually using
the output though.
And thank you to both for suggesting PROC SURVEYSELECT as a means of
constructing a dataset with X replicates of the original data.
Oloolo's last reply most fits the spirit of my original post, i.e. I
wanted to know if I really needed to create the dataset with X
replicates of my original data, or if I could somehow get by with one
replicate and apply each of the X models to it to get X scored data
sets (but without resorting to a do loop). If I use PROC LOGISTIC
along with the original data, it will tell me that the BY variable is
missing from that dataset. The SCORE procedure does allow that, as
indicated by the following message:
NOTE: No BY variables are present in the DATA= data set. Each BY group
of the SCORE= data set will be used to compute scores for the entire
DATA= data set.
which is indeed what I want, but would have liked to be able to use
PROC LOGISTIC to actually get some of the statistics it computes
automatically. I am using the last example from this page (http://
support.sas.com/kb/25/019.html).
For now if my understanding is correct, it appears that I can either
create X replicates of the original data (using PROC SURVEYSELECT as
outlined by Data _null_) and use PROC LOGISTIC to score these X
replicates with my X sets of regression coefficients, getting X GOF
measures in the process, or use PROC SCORE with a BY statement on 1
replicate of the original data and thereby score this dataset with my
X sets of regression coefficients, but without the GOF measures, is
that correct?
Thank you,
Daniel
On Sep 22, 12:20 pm, dynamicpa...@YAHOO.COM (oloolo) wrote:
> If it is only for the Scoring step, OP does not need to replicate the
> original data X times
> use:
> PROC SCORE , the REPLICATE variable in SCORE= dataset will to do this job,
> for example:
>
> data Remission;
> input remiss cell smear infil li blast temp;
> label remiss='Complete Remission';
> datalines;
> 1 .8 .83 .66 1.9 1.1 .996
> 1 .9 .36 .32 1.4 .74 .992
> 0 .8 .88 .7 .8 .176 .982
> 0 1 .87 .87 .7 1.053 .986
> 1 .9 .75 .68 1.3 .519 .98
> 0 1 .65 .65 .6 .519 .982
> 1 .95 .97 .92 1 1.23 .992
> 0 .95 .87 .83 1.9 1.354 1.02
> 0 1 .45 .45 .8 .322 .999
> 0 .95 .36 .34 .5 0 1.038
> 0 .85 .39 .33 .7 .279 .988
> 0 .7 .76 .53 1.2 .146 .982
> 0 .8 .46 .37 .4 .38 1.006
> 0 .2 .39 .08 .8 .114 .99
> 0 1 .9 .9 1.1 1.037 .99
> 1 1 .84 .84 1.9 2.064 1.02
> 0 .65 .42 .27 .5 .114 1.014
> 0 1 .75 .75 1 1.322 1.004
> 0 .5 .44 .22 .6 .114 .99
> 1 1 .63 .63 1.1 1.072 .986
> 0 1 .33 .33 .4 .176 1.01
> 0 .9 .93 .84 .6 1.591 1.02
> 1 1 .58 .58 1 .531 1.002
> 0 .95 .32 .3 1.6 .886 .988
> 1 1 .6 .6 1.7 .964 .99
> 1 1 .69 .69 .9 .398 .986
> 0 1 .73 .73 .7 .398 .986
> ;
> run;
>
> proc sort data=Remission; by Remiss; run;
> ods select none;
> proc surveyselect data=Remission out=samp rate=1 method=urs outhits rep=10;
> strata remiss;
> run;
> ods select all;
>
> proc sort data=samp;by Replicate; run;
>
> proc logistic data=samp outest=est noprint;
> by Replicate;
> model remiss(event='1')=cell smear infil li blast temp;
> run;
>
> proc score data=Remission(rename=(remiss=remiss0)) out=out score=est
> type=parms;
> by replicate;
> var cell smear infil li blast temp;
> run;
>
>
>
> On Tue, 22 Sep 2009 10:43:29 0500, Data _null_; <iebup...@GMAIL.COM> wrote:
> >No read it again
>
> >> this means that I need to have to create a dataset with my original
> >> data repeated X times, each time with a new value of REPLICATE
>
> >On 9/22/09, oloolo <dynamicpa...@yahoo.com> wrote:
> >> add one more option: OUTHITS
> >> otherwise multiple replicated records will be collapsed into one
> >> besides, for Bootstrap analysis, OP needs to sample WITH REPLACEMENT, not
> >> WITHOUT REPLACEMENT
>
> >> **********************;
> >> ods select none;
> >> proc surveyselect data=sashelp.class out=class100
> >> rate=1 method=urs rep=100 outhits;
> >> run;
> >> ods select all;
> >> **********************;
>
> >> On Tue, 22 Sep 2009 10:23:39 0500, Data _null_; <iebup...@GMAIL.COM>
> wrote:
>
> >> >On 9/22/09, Daniel <daniel.biostatist...@gmail.com> wrote:
> >> >> this means that I need to have to create a dataset with my original
> >> >> data repeated X times, each time with a new value of REPLICATE
>
> >> >METHOD=URS does NOT produce the data the that I think the OP is
> >> >requesting. If I understand correctly he wants to replicate the
> >> >original data set REP=n times.
>
> >> >Similar to this but with less work.
>
> >> >data class10;
> >> > set
> >> > sashelp.class(in=in1 )
> >> > sashelp.class(in=in2 )
> >> > sashelp.class(in=in3 )
> >> > sashelp.class(in=in4 )
> >> > sashelp.class(in=in5 )
> >> > sashelp.class(in=in6 )
> >> > sashelp.class(in=in7 )
> >> > sashelp.class(in=in8 )
> >> > sashelp.class(in=in9 )
> >> > sashelp.class(in=in10) open=defer;
> >> > replicate = index(cats(of in:),'1');
> >> > run;
>
> >> >Using URS does not do that produce that same result.
>
> >> >2048 proc surveyselect method=urs rate=1 rep=10 data=sashelp.class
> >> >out=class10;
> >> >2049 run;
>
> >> >NOTE: The data set WORK.CLASS10 has 124 observations and 7 variables.
>
> >> >On 9/22/09, oloolo <dynamicpa...@yahoo.com> wrote:
> >> >> in addition to what DATA _NULL_ said, be sure to use:
> >> >> method=urs
> >> >> to get a random sample WITH REPLACEMENT
> >> >> you can set other values for "rate=", say rate=0.7
>
> >> >> proc surveyselect data=yourdata out=sample
> >> >> rate=1 method=urs rep=100;
> >> >> run;
>
> >> >> On Tue, 22 Sep 2009 10:01:24 0500, Data _null_; <iebup...@GMAIL.COM>
> >> wrote:
>
> >> >> >Consider a SURVEYSELECT with RATE=1. This is in one of Cassel's
> paper
> >> >> >but you may have missed it.
>
> >> >> >2042 proc surveyselect rate=1 rep=10 data=sashelp.class out=class10;
> >> >> >2043 run;
>
> >> >> >NOTE: Under the specified sampling rate, all units will be included
> in
> >> >> >the sample.
> >> >> >NOTE: The data set WORK.CLASS10 has 190 observations and 6 variables.
>
> >> >> >On 9/22/09, Daniel <daniel.biostatist...@gmail.com> wrote:
> >> >> >> Good morning All,
>
> >> >> >> I am developing a predictive model (outcome binary) following the
> >> >> >> methodology outlined in "Clinical prediction models" by
> Steyerberg, or
> >> >> >> that in StatMed vol. 15 pp. 361387 (Multivariable prognostic
> models:
> >> >> >> Issues in developing models, evaluating assumtions and adequacy,
> and
> >> >> >> measuring and reducing errors). I am using bootstrap to obtain
> >> >> >> measures of shrinkage and optimism to correct my regression
> >> >> >> coefficients and goodness of fit (GOF) measures (respectively) for
> >> >> >> overfitting. The steps include:
>
> >> >> >> 1. Obtain X bootstrap samples with replacement, of the same size as
> >> >> >> the original data
> >> >> >> 2. Use each sample to model the outcome using, in our case, a fixed
> >> >> >> set of covariates. Get GOF measures of interest
> >> >> >> 3. Score the original data with the model obtained in 2. Obtain GOF
> >> >> >> measures of interest on the model applied to the original data
> >> >> >> ... some additional steps irrelevant to my question
>
> >> >> >> I've used David Cassell's advice to program, in very few lines,
> steps
> >> >> >> 1 and 2, by building a dataset with my X bootstrap samples with
> >> >> >> replacement, and then running PROC LOGISTIC with the "BY REPLICATE"
> >> >> >> statement.
>
> >> >> >> To score the original data using each of my X models, I used the
> >> >> >> OUTEST= option in my PROC LOGISTIC run of step 2, and I then run a
> >> >> >> second PROC LOGISTIC, this time with the INEST= option. But for
> this
> >> >> >> to work the way I want, I need to use a "BY REPLICATE" statement
> and
> >> >> >> this means that I need to have to create a dataset with my original
> >> >> >> data repeated X times, each time with a new value of REPLICATE.
> This
> >> >> >> allows me to avoid the do loop. The negative aspect (though it
> might
> >> >> >> be mitigated by the efficiency of using the BY statement) is that I
> >> >> >> need to create this dataset and depending on the value of X, it can
> >> >> >> get quite large. Can you think of other ways this could be done as
> >> >> >> efficiently as steps 1 and 2 (perhaps from your own experiences)?
>
> >> >> >> Thank you.
>
> >> >> >> Daniel
