Date: Sun, 13 Sep 2009 23:05:17 -0400 Sigurd Hermansen "SAS(r) Discussion" Sigurd Hermansen statistical discrimination methods text/plain; charset="us-ascii"

A statistical discrimination problem that I struggle with often has several complications that makes it difficult to assess in a statistical model: namely, no verified training data; one highly discriminating predictor and many weaker ones; high levels of missing values of predictors, including the highly discriminating one; and, substantial correlation among predictors.

The simplified procedure below illustrates a method that predicts better under these conditions than logistic regression alone, CART/TreeNet, multiple imputation, Gaussian mixtures models, graph models, and other methods that I have tried. I would prefer a method that predicts as well but has a better interpretation or theoretic foundation. I'd appreciate comments and suggestions. S /* Sample data (binary outcome and predictors) for testing. Actual data have a computed outcome and continuous predictors. */ data matchScores; do i=1 to 100000; outcome=1; x=1; y=1; z=1; if ranuni(99031)>0.3 then do; outcome=0; x=0; y=0; z=0; end; if ranuni(99037)<0.2 then do; outcome=0; end; if ranuni(98043)<0.5 then do; outcome=.; end; if ranuni(98037)<0.2 then do; x=1; end; if ranuni(99043)<0.2 then do; x=0; end; if ranuni(99017)<0.1 then do; x=.; end; if ranuni(97037)<0.1 then do; y=1; end; if ranuni(97043)<0.1 then do; y=0; end; if ranuni(97017)<0.2 then do; y=.; end; if ranuni(96037)<0.2 then do; z=1; end; if ranuni(96043)<0.1 then do; z=0; end; if ranuni(96017)<0.1 then do; z=.; end; if missing(outcome) and missing(x) and missing(y) and missing(z) then do; if ranuni(95017)<0.5 then x=1; else x=0; end; /* Introduce high levels of correlation */ if ranuni(91011)<0.75 then y=x; if ranuni(91013)<0.5 then y=z; if ranuni(91017)<0.75 then z=x; if ranuni(88017)<0.25 then do; z=x; y=x; end;

output; end; run; /* Following a quick stage-wise assessment of discriminatory values of scores derived from outcomes as a function of predictor values .... */ proc glmselect data=matchScores; model outcome= x y z /selection=lar ; run;

/* for records with no xi missing, including outcome, estimate parameters of a logistic model with main effects and interactions. The strong predictors used to assign initial outcome values do not appear in the primary set of predictors in the model. A scoring process computes probabilities for each observation. The ROC curve describes the predictive performance of the model. Records with missing values of any one of the predictors initially have missing outcome values. */ proc logistic data=matchScores outmodel=RMModel descending; model outcome = x y z x*y x*z y*z /firth ctable pprob = (0 to 1 by .01) outroc=roc ; ods output ParameterEstimates=ParameterEstimates; run; proc logistic inmodel=RMModel; score data=matchscores out=RMscores; run; symbol1 i=join v=none c=blue; proc gplot data=roc; plot _sensit_*_1mspec_=1/vaxis=0 to 1 by .1 ; run; /* Recursive updating refines probability estimates. As initial values we assign computed probabilities for records with strong predictors and non-missing predictor values. To others we assign default match probabilities related to the overall expected "success" rate. We define binary outcomes using values of strong predictor to assign one (success) or zero to observations non-missing values. To observations with missing and essentially incomplete values of the strong predictor, we assign missing values. */

%macro computeEstLR(__seed, __sampleRate, __proc, __predictor, __revEst, __parms, __freq=, __outcome=postPr, __priorEst=postPr, __OOSOutcomes=%str(Prs), __MinP=%str(0.1), __CP=%str(0.8), /* Confidence in prior estimate. */ __CE=%str(0.2) /* Confidence in evidence and correlation adjustment. */ );

proc sql; create table OOSOutcomeSamp as select case when P_1 IS NULL then &__MinP else P_1 end as P_1, case when &__outcome IS NULL then 0.4 else &__outcome end as &__outcome, * from &__OOSOutcomes where ranuni(&__seed)< &__sampleRate ; quit; proc &__proc data=OOSOutcomeSamp %if &__proc=GENMOD % then %do; descending %end; ; model &__outcome= &__predictor %if &__proc=GENMOD %then %do; / link=logit dist=bin %end; %if &__Proc=GLMSELECT %then %do; / selection=lar %end; ; %if (&__freq^=) %then %do; freq &__freq; %end; ods output ParameterEstimates=&__parms ; run; quit;

/* Initial estimates of likelihood exp(a + bx)/exp(a) (full model with [b=0] -> restricted model) converted to probability estimates and assigned to outcome (missing where highly discriminating predictor missing). A new estimate of the probability of success for observations with non-missing predictors (and missing outcomes assigned a neutral value initially) becomes the probability of success in the next round of the estimation process. */ proc sql; create table Prs as /* Capture estimated probability from alpha, beta parameter estimates. */ select ((select Estimate from &__parms where UPCASE(Parameter)="INTERCEPT") + (&__predictor * (select Estimate from &__parms where UPCASE(Parameter)=UPCASE("&__predictor"))) ) as Pr1, /* Shrink estimates at extremes of range. */ calculated Pr1 /(max(calculated Pr1,(1.00001 - calculated Pr1))) as wPr1, /* Add estimate of increase of probability due to observation i, reduced by predictor correlation factor __CE, to prior estimate of probability, reduced by factor representing confidence in prior estimate. */ case when NOT (&__predictor IS NULL) then max(min((&__CP * &__priorEst) + (&__CE * calculated wPr1) ,1),0) else &__priorEst end as &__revEst, * from OOSOutcomeSamp ; quit;

%mend computeEstLR;

%computeEstLR(32117, %str(1), GLMSELECT, x, postPr, parms, __outcome=P_1, __priorEst=P_1, __OOSOutcomes=RMScores )

%computeEstLR(32119, %str(1), GLMSELECT, y, postPr, parms, __priorEst=P_1 ) %computeEstLR(32119, %str(1), GLMSELECT, z, postPr, parms, __priorEst=postPr ) %computeEstLR(32117, %str(1), GLMSELECT, x*y, postPr, parms, __priorEst=postPr ) %computeEstLR(33123, %str(1), GLMSELECT, y*z, postPr, parms, __priorEst=postPr ) %computeEstLR(33137, %str(1), GLMSELECT, x*z, postPr, parms, __priorEst=postPr ) %computeEstLR(33173, %str(1), GLMSELECT, x*y*z, postPr, parms, __priorEst=postPr )

Back to: Top of message | Previous page | Main SAS-L page