LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (September 2005, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 2 Sep 2005 16:32:02 -0400
Reply-To:     Peter Flom <flom@NDRI.ORG>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Peter Flom <flom@NDRI.ORG>
Subject:      Re: On Hosmer-Lemeshow, etc. and Model Selection
Comments: To: topkatz@MSN.COM
Content-Type: text/plain; charset=US-ASCII

Peter L. Flom, PhD Assistant Director, Statistics and Data Analysis Core Center for Drug Use and HIV Research National Development and Research Institutes 71 W. 23rd St www.peterflom.com New York, NY 10010 (212) 845-4485 (voice) (917) 438-0894 (fax)

>>> Talbot Michael Katz <topkatz@MSN.COM> 09/02/05 4:13 PM >>> write <<< I really, REALLY like this thread, and hope it won't die a premature Labor Day Weekend death. >>>

I will be checking e-mails....

<<< Peter -- Not a guru? Howzabout we'll call you "Guru, Jr." for now. So, let's talk about categorizing continuous variables. While I nod my head in vigorous agreement with all the drawbacks you point out, I am still an enthusiastic advocate of discretization, and I'm surprised that you, as a logistic modeler, are not. I think of discretization as the logistic modeler's secret weapon (when used wisely, of course). What it comes down to is finding non-monotone response. Let's look at the example that you gave about age grouping and heart attacks. I don't know the data at all, but I'm guessing that the probability of having a heart attack pretty much increases with age, you know, wear and tear and all that. But, perhaps there are other forces at work... people with weak hearts will die earlier and stronger people will live longer, so their probability may decrease after a certain age. Suppose the percentage of heart attacks in the 45-54 population is 10%, and then it goes up to 20% in the 55-64 group and back down to 10% in the 65-74. Well, then, untransformed age might not show up as predictive for heart attacks in a logistic model, but put it into these bins, and bingo! Okay, you might contend that trees are better for non-monotone response capture, but even there you can increase your chances with well-placed binning (of course, you have to be very careful to avoid overfitting). This is something I've spent a lot of time thinking about and looking at, and I'm curious about your views (and anyone else's). >>>'

Nonmonotone relations can certainly exist. But I don't think binning is the way to get at them. That's what quadratic terms are form. Or cubic splines, if you want to get fancy.

Binning, as I noted earlier, supposes that something 'magic' happens at the bin points. But, in most cases, it doesn't work like that. Things change smoothly. Quadratic terms and splines capture this smoothness, bins don't. Suppose that the peak of risk is in the middle of a bin? What then? Change the bin just to fit the data - well, that MIGHT be okay if you are using one data set to build the model and the other to test it, but it just doesn't make sense......that's not how things work.

OK, there are SOME cases where things SHOULD be binned. For example, something signifiicant happens to driving behavior when you're old enough to have a license - but even here, it should not be binned by AGE, since the age varies state by state and country by country. Or, there was an interesting thing I saw that binned education into "completed' vs. 'stopped in middle' and found a big difference in depression scores. (stopped in middle could be middle of HS, middle of university, before your orals, etc.)

But usually things vary smoothly.

Peter


Back to: Top of message | Previous page | Main SAS-L page