Date: Fri, 2 Sep 2005 09:26:08 +0300
Reply-To: BoraYavuz@HSBC.COM.TR
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Bora Yavuz <BoraYavuz@HSBC.COM.TR>
Subject: On Hosmer-Lemeshow, etc. and Model Selection
Content-Type: text/plain; charset=Windows-1254
Hi,
May I explain things first -- 'cos I know the wise stat guys out there
(David, Peter, ...) will beat me up for further info if I don't. :-)
--> I use PROC LOGISTIC for propensity / response modelling -- i.e., I aim
to predict who will respond to a specified marketing action. We normally
have a 0-1 response variable and use "dummy variable coding" to build (a)
scorecard(s) using logistic regression.
--> Our data sets typically contain 100 to 200k observations and around 500
columns. The response rates vary between 1% to 20%.
--> For reasons that are yet unclear to me (unclear since we do not conduct
any surveys), we sometimes do "oversampling" too -- i.e., we do stratified
sampling on two or three variables (putting a cap on the number of
non-responders in some strata) in which case we use PROC LOGISTIC with the
WEIGHT statement.
My (zillions of) questions are as follows:
--> How seriously should I take the Hosmer-Lemeshow statistic in assessing
goodness-of-fit? Some folks say it does not mean anything in the case of
using sampling weights. Some folks say since all my variables (both
dependent and independent) are dummies, it is of little use if any at all.
I'm confused.
--> Similarly, is the "classification table" output meaningless in the case
of sampling weights? If so how can we adjust?
--> I ended up with one model (Model A) with a perfect Hosmer-Lemeshow
statistic (p value around .46) in the development sample and with Gini
values in the development and test samples being 68 and 69.16 respectively.
Also the scores align in both samples (though it isn't very smooth -- some
score bands have similar response rates). Another model (Model B) has a
Hosmer-Lemeshow p-value of ~.13 (still legal) and higher Gini's in both
development and test samples: 69 and 69.72 respectively. Also there is
some misalignment in both samples for low score bands. Yet another model
(Model C) messes up in Hosmer-Lemeshow (p-value < 0.01) and depicts slight
misalignment in both samples too. However it has higher and closer Gini
values: 69.72 and 69.84. In complexity, Model C > Model B > Model A (">"
meaning "is more complex than"). Moreover bootstrap estimates and
confidence intervals are more stable in C then in B and A (C > B > A). So
how would you decide? (I'm sort of inclined to choose B.)
--> Since our purpose is prediction and we do not aim to explain anything,
why don't we just "throw in" as many variables as possible and / or choose
the AIC-optimal (or the AIC[3/2]-optimal) model? The papers I read on the
topic suggest including as many variables as possible and / or average all
alternative models' coefficients (or probability estimates). Is this true?
[By the way, I threw away the highly-correlated variables before attempting
model building.]
--> Again, since we do prediction we do not have to worry about severe
multi-collinearity or do we? [My boss insists we use PROC REG with the
TOL, VIF and COLLINOINT options to diagnose and later eliminate
multi-collinearity.]
--> And finally, had I better use PROCSURVEYLOGISTIC as opposed to PROC
LOGISTIC with the WEIGHT statement? Is it any btter? If so, what is the
"OUTEST=" equivalent in PROC SURVEYLOGISTIC (I couldn't find it in the
docs)?
--> What are your guidelines for model selection? Which criteria do you
consider most critical in choosing the "best" model for prediction? How
would you avoid overfitting or underfitting?
So many thanks in advance,
Bora Y.
Bu E-posta mesajı gizlidir. Ayrıca hukuken de gizli olabilir.
Mesajın gönderilmek istendiği kişi siz değilseniz hiçbir kısmını
kopyalayamaz, başkasına gönderemez, başkasına açıklayamaz veya
kullanamazsınız. Eğer bu mesaj size yanlışlıkla ulaşmışsa, lütfen
mesajı ve tüm kopyalarını sisteminizden silin ve gönderen kişiyi
E-posta yolu ile bilgilendirin.
İnternet iletişiminde zamanında, güvenli, hatasız ya da virüssüz
gönderim garanti edilemez.
Gönderen taraf hata veya unutmalardan sorumluluk kabul etmez.
********************************************************************
This E-mail is confidential. It may also be legally privileged. If
you are not the addressee you may not copy, forward, disclose or use
any part of it. If you have received this message in error, please
delete it and all copies from your system and notify the sender
immediately by return E-mail.
Internet communications cannot be guaranteed to be timely, secure,
error or virus-free.
The sender does not accept liability for any errors or omissions.
********************************************************************
(Embedded image moved to file: pic01799.pcx)