Date: Fri, 2 Sep 2005 16:32:02 -0400
Reply-To: Peter Flom <flom@NDRI.ORG>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Peter Flom <flom@NDRI.ORG>
Subject: Re: On Hosmer-Lemeshow, etc. and Model Selection
Content-Type: text/plain; charset=US-ASCII
Peter L. Flom, PhD
Assistant Director, Statistics and Data Analysis Core
Center for Drug Use and HIV Research
National Development and Research Institutes
71 W. 23rd St
www.peterflom.com
New York, NY 10010
(212) 845-4485 (voice)
(917) 438-0894 (fax)
>>> Talbot Michael Katz <topkatz@MSN.COM> 09/02/05 4:13 PM >>> write
<<<
I really, REALLY like this thread, and hope it won't die a premature
Labor Day Weekend death.
>>>
I will be checking e-mails....
<<<
Peter -- Not a guru? Howzabout we'll call you "Guru, Jr." for now. So,
let's talk about categorizing continuous variables. While I nod my head
in vigorous agreement with all the drawbacks you point out, I am still
an enthusiastic advocate of discretization, and I'm surprised that you,
as a logistic modeler, are not. I think of discretization as the
logistic
modeler's secret weapon (when used wisely, of course). What it comes
down to is finding non-monotone response. Let's look at the example
that you gave about age grouping and heart attacks. I don't know the
data at all, but I'm guessing that the probability of having a heart
attack pretty much increases with age, you know, wear and tear and all
that. But, perhaps
there are other forces at work... people with weak hearts will die
earlier and stronger people will live longer, so their probability may
decrease after a certain age. Suppose the percentage of heart attacks
in the 45-54 population is 10%, and then it goes up to 20% in the 55-64
group and back down to 10% in the 65-74. Well, then, untransformed age
might not show up
as predictive for heart attacks in a logistic model, but put it into
these bins, and bingo! Okay, you might contend that trees are better
for non-monotone response capture, but even there you can increase your
chances with well-placed binning (of course, you have to be very careful
to avoid overfitting). This is something I've spent a lot of time
thinking about and looking at, and I'm curious about your views (and
anyone else's).
>>>'
Nonmonotone relations can certainly exist. But I don't think binning is
the way to get at them. That's what quadratic terms are form. Or cubic
splines, if you want to get fancy.
Binning, as I noted earlier, supposes that something 'magic' happens at
the bin points. But, in most cases, it doesn't work like that. Things
change smoothly. Quadratic terms and splines capture this smoothness,
bins don't. Suppose that the peak of risk is in the middle of a bin?
What then? Change the bin just to fit the data - well, that MIGHT be
okay if you are using one data set to build the model and the other to
test it, but it just doesn't make sense......that's not how things work.
OK, there are SOME cases where things SHOULD be binned. For example,
something signifiicant happens to driving behavior when you're old
enough to have a license - but even here, it should not be binned by
AGE, since the age varies state by state and country by country. Or,
there was an interesting thing I saw that binned education into
"completed' vs. 'stopped in middle' and found a big difference in
depression scores. (stopped in middle could be middle of HS, middle of
university, before your orals, etc.)
But usually things vary smoothly.
Peter
|