| Date: | Wed, 15 Jul 2009 14:07:09 -0400 |
| Reply-To: | Sigurd Hermansen <HERMANS1@WESTAT.COM> |
| Sender: | "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU> |
| From: | Sigurd Hermansen <HERMANS1@WESTAT.COM> |
| Subject: | Re: Treatment of Missing Variables - What Options |
|
| In-Reply-To: | <8490ba90907150729u6d114da5v519a260be6f42677@mail.gmail.com> |
| Content-Type: | text/plain; charset="us-ascii" |
CY:
The missing value problem has received a lot of attention in the survey world where non-response bias tends to be a major issue. The general solution is to impute missing values but adjust confidence intervals of estimates to reflect loss of information.
Several SAS-L statistics wizards much more adept than I at imputation have series of discussions on that topic in the SAS-L Archives. Search on "Cassell" and "missing" (he's in fact been MIA from the 'L during the last year, but that's incidental). I doubt that your missing values are Missing At Random (MAR) as you might expect with some survey errors. Refusals to respond to sensitive questions should be assigned special missing values and treated as refusals, but that doesn't always happen.
SAS PROC MI and PROC MIANALYZE handle multiple imputation and statistical adjustments for imputation, but work better for lower levels of missing values. CART and other recursive partitioning methods use surrogate values for missing values.
I'd recommend that you compare estimates after imputation with PROC MI to estimates computed from the sample with obs with missing values excluded. You might also derive a surrogate variable for unprotected sexual intercourse that composes data from several related variables to fill gaps. Comparisons of the results of different methods may help you answer your questions.
I haven't seen a straightforward and global method for handling missing value problems. Some argue that selection of observations and imputation should be part of a larger program that cleanses data systematically prior to and during analyses.
S
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Chao Yawo
Sent: Wednesday, July 15, 2009 10:29 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Treatment of Missing Variables - What Options
Hello,
I am running a logistic regression model (using a Demographic and
Health surveys dataset) and realized a drastic reduction in my
sub-population size. I traced the problem to a variable with a lot of
missing cases. As you can see from the table below, this variable
elicits whether the respondent engaged in unprotected sexual
intercourse. About a third of the cases (33.78%) are missing.
V761 -- Last intercourse used condom
-----------------------------------------------------------
| Freq. Percent Valid Cum.
--------------+--------------------------------------------
Valid 0 No | 6012 56.16 84.81 84.81
1 Yes | 1075 10.04 15.16 99.97
9 | 2 0.02 0.03 100.00
Total | 7089 66.22 100.00
Missing . | 3617 33.78
Total | 10706 100.00
-----------------------------------------------------------
According to the DHS - Demographic and health surveys, :
A "missing value" is defined as a variable that should have a
response, but because of interview errors the question was not asked.
The general rule for the survey data processing is that under no
circumstances an answer should be made up. Instead, a missing value is
assigned in the data file (see:
http://www.measuredhs.com/accesssurveys/Data_quality_use.cfm#1).
So the missing values result from interview errors. It occurred to me
that most of the people who are missing on the condom use variable may
not be sexually active or have not reached sexual debut. so I created
a new variable for condom use, assigning a value of 2 to those who are
Missing (V6=761_Miss), and crosstabulated it the variable for those
who are Sexually Active (V531_R), with the following results:
| RECODE of V761_R (RECODE of V761
RECODE of V536 | (Last intercourse used condom
(Recent sexual | (See also SMV761)
activity) | Not Used Used Missing | Total
------------------+---------------------------------+----------
NotSexuallyActive | 0 0 2,146 | 2,146
| 0.00 0.00 59.40 | 20.06
------------------+---------------------------------+----------
SexuallyActive | 6,012 1,075 1,467 | 8,554
| 100.00 100.00 40.60 | 79.94
------------------+---------------------------------+----------
Total | 6,012 1,075 3,613 | 10,700
| 100.00 100.00 100.00 | 100.00
Given that close to 60% of those who are "Missing" on the condom use
variable are not sexually active, I decided to check if there is a
strong/significant relationship between the missing value and my
dependent variable, subsetting those who are sexually active. I
created a variable called mis that records the missing values of my
offending variable, and regressed my dependent variable (V781_R) on
it, and got the following results:
logistic mis V781_R [pweight=weight], cluster(psu), if V536_R==1
(sum of wgt is 8.8262e+03)
Iteration 0: log pseudolikelihood = -3739.3157
Iteration 1: log pseudolikelihood = -3729.0254
Iteration 2: log pseudolikelihood = -3728.8988
Iteration 3: log pseudolikelihood = -3728.8988
Logistic regression Number of obs = 8436
Wald chi2(1) = 0.55
Prob > chi2 = 0.4590
Log pseudolikelihood = -3728.8988 Pseudo R2 = 0.0028
(Std. Err. adjusted for 357 clusters in psu)
------------------------------------------------------------------------------
| Robust
mis | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
V781_R | .383671 .5181314 0.74 0.459 -.631848 1.39919
_cons | -1.695747 .1268113 -13.37 0.000 -1.944293 -1.447202
------------------------------------------------------------------------------
Given the non-significance of the Variable, it does appear that the
errors are not related to my DV. In fact, the DV had only 161 missing
variables. However, since the dependent variable in my deals with HIV
risk, I need to include sexual risk variables such as the V761 in the
model.
One option is that I can ignore the errors on that single IV , but
then it means I will have to accept the lower N (sample size) my
analysis, and explain that in my write-up (that changes in sample size
for the regression result from missing values on some of the
covariates.
Does this sound like a reasonable option? What other options do I have?
Thanks in advance for your help.
CY
|