LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (December 2004, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Sat, 4 Dec 2004 12:21:19 -0800
Reply-To:     Dale McLerran <stringplayer_2@YAHOO.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
Comments:     DomainKeys? See http://antispam.yahoo.com/domainkeys
From:         Dale McLerran <stringplayer_2@YAHOO.COM>
Subject:      Re: USING  GENMOD to impute MISSING values
In-Reply-To:  <OFC3C0678D.902829C8-ON88256F5E.007EADDD-88256F5E.00819347@epamail.epa.gov>
Content-Type: text/plain; charset=us-ascii

--- "David L. Cassell" <cassell.david@EPAMAIL.EPA.GOV> wrote:

> "Elmaache, Hamani" <Hamani.Elmaache1@CCRA-ADRC.GC.CA> wrote: > > I'm trying to impute missing values in some data set; I have 34 > missing > > values for the variable > > S3Q7, but in variable AGECAT1 no missing values ( here below, > results > of > > Proc freq ). > > I try to use Proc GENMOD to get predicted values for S3Q7( or more > > precisely, the probability of > > its values: 1, 2, 3, 4, 4, 5) > > Using the following CODE, I can not cope with it: > > > > > > proc genmod DATA=recoded ; > > class AGECAT1; > > model s3q7 = AGECAT1 / dist=multinomial; > > output out = miss pred = Predit; > > run; > > > > In the output miss 4 times the data. I did not understand why. > > First, you should not be doing single imputation. Period. > > Imputing the data gives you an extra set of values WHICH DO NOT HAVE > THE SAME VARIANCE STRUCTURE AS THE ORIGINAL DATA! This is a very bad > thing. Single imputation also leaves you with no way of separating > out the effects of filling in 'holes' with the effects of fitting the > intended model. > > If you want to impute those data, then first you must examine the > data and decide whether the reason for the data going missing is > such that it is reasonable to treat those records as even coming > from the same population as the complete records. If not, then > you need to be very wary of using the data you have to impute the > data you're missing - after all, you just concluded that you don't > have the right population to work from! > > If you conclude that imputation IS reasonable, then you should be > using PROC MI to do multiple imputation, followed by PROC MIANALYZE > to analyze your analyses and estimate the consequences of your > imputation efforts. At this point, the structure of your missing > data must be studied. Is it monotone missing? Can it be turned > into monotone missing data using MCMC to impute a few holes? Look > at the documentation for PROC MI and see how well it fits with what > you have. > > HTH, > David > --

David provides such solid advice that it is difficult to find opportunities to add to what he states or to disagree with any portion of his advice. I will do a bit of both here.

What I would add is more toward a question. It is not clear to me that imputation is even necessary here, based on what has been presented. If we have only two variables (age and s3q7, with some values of s3q7 missing) and we want to look at their bivariate relationship, then we can be in either of two situations: 1) the values of age differ between those who have missing s3q7 and those who have nonmissing s3q7, or 2) the values of age do not differ (in any meaningful way) between those same two groups. Now, if the values of age differ between the two groups so that these two groups represent different populations, then you have probably do not have enough information available to you to be able to impute the values of s3q7. David inferred as much in the statements that he made, and to this point I wholeheartedly agree with David's comments.

Now, if the age distribution is comparable for those who have missing s3q7 and those having nonmissing s3q7 and the analysis that you wish to perform is of the relationship between these two variables only, then I don't believe that imputation will provide any benefit whatsoever. Imputation of s3q7 from age only contains the information which was present about the relationship between s3q7 and age to begin with. The imputation will not add information about the relationship between age and s3q7. You will have additional observations, yes. But those observations have the uncertainty that is in the observed data. We don't know the truth for a specific observation, but only that there is a probability distribution for the value of s3q7. Thus, when you properly analyze the bivariate relationship between age and s3q7 employing multiple imputation and the uncertainty of the imputed values, you should end up with what you had among those for whom the values of s3q7 were nonmissing. If you were to look at the relationship between s3q7 and some other variable, then the imputed data would yield some benefit. However, I get the impression that you are really just looking at the bivariate relationship between age and s3q7. Is that correct?

Now, if there is some reason to employ imputation, then I will disagree with David on the imputation procedure to use. All of the imputation procedures available with PROC MI are parametric imputation procedures. In the present situation, I believe that a nonparametric approach would be much better. I would be very leary of any of the assumptions employed in the parametric imputation methods. Fortunately, a nonparametric imputation process would be quite easy to implement. For each of the three age categories, you have an observed distribution of s3q7. That is, from a simple crosstabulation, one knows the probability of s3q7 being in 1, 2, 3, 4, or 5. That is, you have the table

AgeCat S3q7 | 1 | 2 | 3 | ---------------------- | | | | 1 | p1_1 | p2_1 | p3_1 | | | | | ---------------------- | | | | 2 | p1_2 | p2_2 | p3_2 | | | | | ---------------------- | | | | 3 | p1_3 | p2_3 | p3_3 | | | | | ---------------------- | | | | 4 | p1_4 | p2_4 | p3_4 | | | | | ---------------------- | | | | 5 | p1_5 | p2_5 | p3_5 | | | | | ---------------------- 1.00 1.00 1.00

Note that pI_J is the probability that s3q7 is in level J given that age is in level I. Now, given the probability distribution for those with nonmissing s3q7, you can write a data step to perform the imputation. Thus, we could write something like the following:

data imputed(rename=(response=s3q7)); set mydata; do imputation=1 to 10; /* generate 10 imputation sets */ if s3q7>.Z then response=s3q7; else do; if agecat=1 then response=rantbl(p1_1,p1_2,p1_3,p1_4); else if agecat=2 then response=rantbl(p2_1,p2_2,p2_3,p2_4); else if agecat=3 then response=rantbl(p3_1,p3_2,p3_3,p3_4); end; output; end; drop s3q7; run;

proc sort data=imputed; by imputation; run;

Now you can run the appropriate analytic procedure employing the dataset IMPUTED with BY variable processing, saving the necessary parameters for final analysis with the procedure MIANALYZE.

HTH,

Dale

===== --------------------------------------- Dale McLerran Fred Hutchinson Cancer Research Center mailto: dmclerra@NO_SPAMfhcrc.org Ph: (206) 667-2926 Fax: (206) 667-5977 ---------------------------------------

__________________________________ Do you Yahoo!? All your favorites on one personal page – Try My Yahoo! http://my.yahoo.com


Back to: Top of message | Previous page | Main SAS-L page