Date: Sat, 4 Dec 2004 12:21:19 -0800
Reply-To: Dale McLerran <stringplayer_2@YAHOO.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Dale McLerran <stringplayer_2@YAHOO.COM>
Subject: Re: USING GENMOD to impute MISSING values
In-Reply-To: <OFC3C0678D.902829C8-ON88256F5E.007EADDD-88256F5E.00819347@epamail.epa.gov>
Content-Type: text/plain; charset=us-ascii
--- "David L. Cassell" <cassell.david@EPAMAIL.EPA.GOV> wrote:
> "Elmaache, Hamani" <Hamani.Elmaache1@CCRA-ADRC.GC.CA> wrote:
> > I'm trying to impute missing values in some data set; I have 34
> missing
> > values for the variable
> > S3Q7, but in variable AGECAT1 no missing values ( here below,
> results
> of
> > Proc freq ).
> > I try to use Proc GENMOD to get predicted values for S3Q7( or more
> > precisely, the probability of
> > its values: 1, 2, 3, 4, 4, 5)
> > Using the following CODE, I can not cope with it:
> >
> >
> > proc genmod DATA=recoded ;
> > class AGECAT1;
> > model s3q7 = AGECAT1 / dist=multinomial;
> > output out = miss pred = Predit;
> > run;
> >
> > In the output miss 4 times the data. I did not understand why.
>
> First, you should not be doing single imputation. Period.
>
> Imputing the data gives you an extra set of values WHICH DO NOT HAVE
> THE SAME VARIANCE STRUCTURE AS THE ORIGINAL DATA! This is a very bad
> thing. Single imputation also leaves you with no way of separating
> out the effects of filling in 'holes' with the effects of fitting the
> intended model.
>
> If you want to impute those data, then first you must examine the
> data and decide whether the reason for the data going missing is
> such that it is reasonable to treat those records as even coming
> from the same population as the complete records. If not, then
> you need to be very wary of using the data you have to impute the
> data you're missing - after all, you just concluded that you don't
> have the right population to work from!
>
> If you conclude that imputation IS reasonable, then you should be
> using PROC MI to do multiple imputation, followed by PROC MIANALYZE
> to analyze your analyses and estimate the consequences of your
> imputation efforts. At this point, the structure of your missing
> data must be studied. Is it monotone missing? Can it be turned
> into monotone missing data using MCMC to impute a few holes? Look
> at the documentation for PROC MI and see how well it fits with what
> you have.
>
> HTH,
> David
> --
David provides such solid advice that it is difficult to find
opportunities to add to what he states or to disagree with any
portion of his advice. I will do a bit of both here.
What I would add is more toward a question. It is not clear to
me that imputation is even necessary here, based on what has been
presented. If we have only two variables (age and s3q7, with
some values of s3q7 missing) and we want to look at their
bivariate relationship, then we can be in either of two
situations: 1) the values of age differ between those who have
missing s3q7 and those who have nonmissing s3q7, or 2) the
values of age do not differ (in any meaningful way) between
those same two groups. Now, if the values of age differ between
the two groups so that these two groups represent different
populations, then you have probably do not have enough
information available to you to be able to impute the values of
s3q7. David inferred as much in the statements that he made,
and to this point I wholeheartedly agree with David's comments.
Now, if the age distribution is comparable for those who have
missing s3q7 and those having nonmissing s3q7 and the analysis
that you wish to perform is of the relationship between these two
variables only, then I don't believe that imputation will
provide any benefit whatsoever. Imputation of s3q7 from age
only contains the information which was present about the
relationship between s3q7 and age to begin with. The imputation
will not add information about the relationship between age and
s3q7. You will have additional observations, yes. But those
observations have the uncertainty that is in the observed data.
We don't know the truth for a specific observation, but only
that there is a probability distribution for the value of s3q7.
Thus, when you properly analyze the bivariate relationship between
age and s3q7 employing multiple imputation and the uncertainty
of the imputed values, you should end up with what you had among
those for whom the values of s3q7 were nonmissing. If you were
to look at the relationship between s3q7 and some other variable,
then the imputed data would yield some benefit. However, I get
the impression that you are really just looking at the bivariate
relationship between age and s3q7. Is that correct?
Now, if there is some reason to employ imputation, then I will
disagree with David on the imputation procedure to use. All of
the imputation procedures available with PROC MI are parametric
imputation procedures. In the present situation, I believe that
a nonparametric approach would be much better. I would be very
leary of any of the assumptions employed in the parametric
imputation methods. Fortunately, a nonparametric imputation
process would be quite easy to implement. For each of the three
age categories, you have an observed distribution of s3q7. That
is, from a simple crosstabulation, one knows the probability of
s3q7 being in 1, 2, 3, 4, or 5. That is, you have the table
AgeCat
S3q7 | 1 | 2 | 3 |
----------------------
| | | |
1 | p1_1 | p2_1 | p3_1 |
| | | |
----------------------
| | | |
2 | p1_2 | p2_2 | p3_2 |
| | | |
----------------------
| | | |
3 | p1_3 | p2_3 | p3_3 |
| | | |
----------------------
| | | |
4 | p1_4 | p2_4 | p3_4 |
| | | |
----------------------
| | | |
5 | p1_5 | p2_5 | p3_5 |
| | | |
----------------------
1.00 1.00 1.00
Note that pI_J is the probability that s3q7 is in level J given
that age is in level I. Now, given the probability distribution
for those with nonmissing s3q7, you can write a data step to
perform the imputation. Thus, we could write something like
the following:
data imputed(rename=(response=s3q7));
set mydata;
do imputation=1 to 10; /* generate 10 imputation sets */
if s3q7>.Z then response=s3q7;
else do;
if agecat=1 then response=rantbl(p1_1,p1_2,p1_3,p1_4); else
if agecat=2 then response=rantbl(p2_1,p2_2,p2_3,p2_4); else
if agecat=3 then response=rantbl(p3_1,p3_2,p3_3,p3_4);
end;
output;
end;
drop s3q7;
run;
proc sort data=imputed;
by imputation;
run;
Now you can run the appropriate analytic procedure employing the
dataset IMPUTED with BY variable processing, saving the necessary
parameters for final analysis with the procedure MIANALYZE.
HTH,
Dale
=====
---------------------------------------
Dale McLerran
Fred Hutchinson Cancer Research Center
mailto: dmclerra@NO_SPAMfhcrc.org
Ph: (206) 667-2926
Fax: (206) 667-5977
---------------------------------------
__________________________________
Do you Yahoo!?
All your favorites on one personal page – Try My Yahoo!
http://my.yahoo.com
|