Date: Fri, 26 Jan 2001 12:29:21 -0800 Dale McLerran "SAS(r) Discussion" Dale McLerran Discretizing continuous vars (was Proc GLM) To: peter.flom@ndri.org text/plain

Folks,

Let me put in my \$.02. Discretizing a continuous variable for use as a predictor variable is a very common artifice in the epidemiological literature. This is usually performed so that the epidemiologist can make some statement about relative risks for some outcome, and convey the RR in a simple manner to their colleagues (or at least an approximation to the RR). Now, it needs to be understood exactly what discretizing the continuous predictor variable actually is doing: it allows the user to fit a nonlinear curve to the data. Moreover, this nonlinear curve is discontinuous at the break points. This is an ugly model if I ever saw one. It says that the response is homogeneous within the (artificially) chosen intervals, and that from the end of one interval to the beginning of the next there is often a significant difference in the response. Now, I ask whether it is reasonable to believe that dietary habits (consumption of fruits and vegetables, percent energy from fat) change dramatically from age 34 to age 35, or from age 59 to age 60. I really suspect not, but these are commonly employed models. I would have to agree with Peter that risk for all kinds of poor outcomes related to low birth weight do not change dramatically from 1499 grams to 1500 grams. The risks are probably even greater if the infant weighs 1100 grams than if the infant weighs 1499 grams. And a child that weighs 1501 grams probably is at more risk for poor outcomes than a child who weighs 2300 grams.

Now, I work with epidemiologists. I have fit many a regression model in which age has been discretized into 3 or 4 intervals. For simple presentation in epidemiological journals, these are the accepted standards. I will not chastise too loudly that this should not be done, although I have tried to suggest alternatives to my colleagues. I have absolutely no doubt that the models which use discretized continuous variables are biased. There are likely very few circumstances in which a noncontinuous response are reasonable. (I leave the door open for a few such outcomes. However, they do not regularly present themselves.)

I have lately been working with an epidemiologist who has had something of an epiphany regarding these issues. When he came to me, he had collaborated with another statistician in the use of flexible regression functions. In particular, for that collaboration they had employed Generalized Additive Models (GAMs). I am not a great fan of GAMs. When you are done fitting the model, can you state the regression equation? I don't believe that GAMs do provide a simple expression. However, there are other tools which allow for flexible regression modelling which yield functions with simple expressions. I had long thought that restricted cubic splines could be a very useful tool for modelling nonlinear (or suspected nonlinear) functions of continuous variables. We are currently using spline methods. Unlike GAMs, with splines you can plug in a value for some continuous predictor and get directly an estimated response. However, even though you may be able to return an estimate directly, it may still be difficult to convey the shape of the response without resorting to graphical methods. This is the direction which I believe we ought to be headed with the modelling of the relationship between responses and continuous covariates: fit some sort of flexible regression and graphically display the fitted response.

For polytomous response models, I have developed a macro which will perform this work in (what I believe to be) a relatively easy to use package. I don't know that it is ready for prime time, but if there is interest in the use of the macro, I would be willing to share it.

>Date: Thu, 25 Jan 2001 13:22:13 -0500 >Reply-To: Peter Flom <peter.flom@NDRI.ORG> >From: Peter Flom <peter.flom@NDRI.ORG> >Subject: Re: Proc GLM >To: SAS-L@LISTSERV.UGA.EDU > >>>> "Dennis G. Fisher" <dfisher@CSULB.EDU> 01/25/01 01:08PM >>> >wrote > >>>>I have to weigh in on this one. Usually I would agree that ruining a >>>perfectly good continuous variable by dichotomizing it is not a good >>>thing to do and I once gave such advice to a grad student. It turned out >>>that I was wrong. The variable was birthweight. This actually turned out >>>to be a dichotomous variable, which is something I did not know at the >>>time. Infants can be classified into low birth weight and non low >>>birthweight. Low birth weight is a proxy (or perhaps an indicator) that >>>there were problems with the pregnancy. So non-low birthweight infants >>>mean that the indicators of lbw problems were not present. It does not >>>mean that infants who are very heavy are somehow protected against >>>these problems. In the case of this grad student, the infants should >>>>have been classified into low birth weight and non low birthweight. >>>Weight should not have been treated as a continuous variable. You >>>have to understand the meaning of the variable before giving an opinion >>>about the analysis. So I guess I agree with Dr. Kruse. > >Clearly, understanding the menaing of the variable before giving an opinion is vital, and I hesitate to argue with someone who knows so much more than I about statistics. > >However, it seems to me that even low birth weight is not a Yes/No variable. > >One classification I have seen is 1500 grams. But, dichotomizing at this point implies that a baby of 1499 grams is markedly different from one weighing 1501 grams. It seems to me that babies who weigh 1,000 grams would be at much more risk that those who weigh 1,500 grams, although I don't know the literature on the subject. I would suspect that, if one graphed "proportion of problem pregnancies" vs. "birth weight" the curve would asymptote at some point. So, one useful transformation of weight might be "weight below" the number at which the asymptote occurs. > > >Does this make sense? > > > >. > > > >Peter L. Flom, Ph.D. >Principal Research Associate >National Development and Research Institutes, Inc. >2 World Trade Center >16th floor >New York, NY 10048 > >(212) 845-4485 >(212) 845-4698 (fax) >Peter.Flom@ndri.org

Dale

--------------------------------------- Dale McLerran Fred Hutchinson Cancer Research Center mailto: dmclerra@fhcrc.org Ph: (206) 667-2926 Fax: (206) 667-5977 ---------------------------------------

------------------------------------------------------------ --== Sent via Deja.com ==-- http://www.deja.com/

Back to: Top of message | Previous page | Main SAS-L page