| Date: | Sun, 13 Sep 2009 23:05:17 -0400 |
| Reply-To: | Sigurd Hermansen <HERMANS1@WESTAT.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Sigurd Hermansen <HERMANS1@WESTAT.COM> |
| Subject: | statistical discrimination methods |
| Content-Type: | text/plain; charset="us-ascii" |
A statistical discrimination problem that I struggle with often has several complications that makes it difficult to assess in a statistical model: namely, no verified training data; one highly discriminating predictor and many weaker ones; high levels of missing values of predictors, including the highly discriminating one; and, substantial correlation among predictors.
The simplified procedure below illustrates a method that predicts better under these conditions than logistic regression alone, CART/TreeNet, multiple imputation, Gaussian mixtures models, graph models, and other methods that I have tried. I would prefer a method that predicts as well but has a better interpretation or theoretic foundation. I'd appreciate comments and suggestions.
S
/*
Sample data (binary outcome and predictors) for testing. Actual data have a computed
outcome and continuous predictors.
*/
data matchScores;
do i=1 to 100000;
outcome=1;
x=1;
y=1;
z=1;
if ranuni(99031)>0.3 then do; outcome=0;
x=0;
y=0;
z=0;
end;
if ranuni(99037)<0.2 then do; outcome=0;
end;
if ranuni(98043)<0.5 then do; outcome=.;
end;
if ranuni(98037)<0.2 then do; x=1;
end;
if ranuni(99043)<0.2 then do; x=0;
end;
if ranuni(99017)<0.1 then do; x=.;
end;
if ranuni(97037)<0.1 then do; y=1;
end;
if ranuni(97043)<0.1 then do; y=0;
end;
if ranuni(97017)<0.2 then do; y=.;
end;
if ranuni(96037)<0.2 then do; z=1;
end;
if ranuni(96043)<0.1 then do; z=0;
end;
if ranuni(96017)<0.1 then do; z=.;
end;
if missing(outcome) and missing(x) and missing(y) and missing(z)
then do;
if ranuni(95017)<0.5 then x=1;
else x=0;
end;
/* Introduce high levels of correlation */
if ranuni(91011)<0.75 then y=x;
if ranuni(91013)<0.5 then y=z;
if ranuni(91017)<0.75 then z=x;
if ranuni(88017)<0.25 then do; z=x; y=x; end;
output;
end;
run;
/*
Following a quick stage-wise assessment of discriminatory values of scores derived from
outcomes as a function of predictor values ....
*/
proc glmselect data=matchScores;
model outcome= x y z
/selection=lar ;
run;
/*
for records with no xi missing, including outcome, estimate parameters of a logistic model
with main effects and interactions. The strong predictors used to assign initial outcome values
do not appear in the primary set of predictors in the model. A scoring process computes probabilities
for each observation. The ROC curve describes the predictive performance of the model. Records with
missing values of any one of the predictors initially have missing outcome values.
*/
proc logistic data=matchScores
outmodel=RMModel descending;
model outcome = x y z x*y x*z y*z
/firth ctable pprob = (0 to 1 by .01) outroc=roc ;
ods output ParameterEstimates=ParameterEstimates;
run;
proc logistic inmodel=RMModel;
score data=matchscores out=RMscores;
run;
symbol1 i=join v=none c=blue;
proc gplot data=roc;
plot _sensit_*_1mspec_=1/vaxis=0 to 1 by .1 ;
run;
/*
Recursive updating refines probability estimates. As initial values we assign computed
probabilities for records with strong predictors and non-missing predictor values. To others
we assign default match probabilities related to the overall expected "success" rate. We define
binary outcomes using values of strong predictor to assign one (success) or zero to observations
non-missing values. To observations with missing and essentially incomplete values of the
strong predictor, we assign missing values.
*/
%macro computeEstLR(__seed,
__sampleRate,
__proc,
__predictor,
__revEst,
__parms,
__freq=,
__outcome=postPr,
__priorEst=postPr,
__OOSOutcomes=%str(Prs),
__MinP=%str(0.1),
__CP=%str(0.8), /* Confidence in prior estimate. */
__CE=%str(0.2) /* Confidence in evidence and correlation adjustment. */
);
proc sql;
create table OOSOutcomeSamp as
select case when P_1 IS NULL then &__MinP
else P_1
end as P_1,
case when &__outcome IS NULL then 0.4
else &__outcome
end as &__outcome,
*
from &__OOSOutcomes
where ranuni(&__seed)< &__sampleRate
;
quit;
proc &__proc data=OOSOutcomeSamp %if &__proc=GENMOD % then %do; descending %end; ;
model &__outcome= &__predictor
%if &__proc=GENMOD %then %do; / link=logit dist=bin %end;
%if &__Proc=GLMSELECT %then %do; / selection=lar %end;
;
%if (&__freq^=) %then
%do; freq &__freq;
%end;
ods output ParameterEstimates=&__parms ;
run;
quit;
/* Initial estimates of likelihood exp(a + bx)/exp(a) (full model with [b=0] -> restricted model)
converted to probability estimates and assigned to outcome (missing where highly discriminating
predictor missing). A new estimate of the probability of success for observations with non-missing
predictors (and missing outcomes assigned a neutral value initially) becomes the probability of
success in the next round of the estimation process.
*/
proc sql;
create table Prs as
/* Capture estimated probability from alpha, beta parameter estimates. */
select ((select Estimate from &__parms where UPCASE(Parameter)="INTERCEPT")
+
(&__predictor * (select Estimate from &__parms where UPCASE(Parameter)=UPCASE("&__predictor")))
)
as Pr1,
/* Shrink estimates at extremes of range. */
calculated Pr1 /(max(calculated Pr1,(1.00001 - calculated Pr1))) as wPr1,
/* Add estimate of increase of probability due to observation i, reduced by predictor
correlation factor __CE, to prior estimate of probability, reduced by factor
representing confidence in prior estimate.
*/
case when NOT (&__predictor IS NULL)
then max(min((&__CP * &__priorEst)
+ (&__CE * calculated wPr1)
,1),0)
else &__priorEst
end as &__revEst,
*
from OOSOutcomeSamp
;
quit;
%mend computeEstLR;
%computeEstLR(32117,
%str(1),
GLMSELECT,
x,
postPr,
parms,
__outcome=P_1,
__priorEst=P_1,
__OOSOutcomes=RMScores
)
%computeEstLR(32119,
%str(1),
GLMSELECT,
y,
postPr,
parms,
__priorEst=P_1
)
%computeEstLR(32119,
%str(1),
GLMSELECT,
z,
postPr,
parms,
__priorEst=postPr
)
%computeEstLR(32117,
%str(1),
GLMSELECT,
x*y,
postPr,
parms,
__priorEst=postPr
)
%computeEstLR(33123,
%str(1),
GLMSELECT,
y*z,
postPr,
parms,
__priorEst=postPr
)
%computeEstLR(33137,
%str(1),
GLMSELECT,
x*z,
postPr,
parms,
__priorEst=postPr
)
%computeEstLR(33173,
%str(1),
GLMSELECT,
x*y*z,
postPr,
parms,
__priorEst=postPr
)
|