LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous (more recent) messageNext (less recent) messagePrevious (more recent) in topicNext (less recent) in topicPrevious (more recent) by same authorNext (less recent) by same authorPrevious page (September 2005, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 2 Sep 2005 09:26:08 +0300
Reply-To:     BoraYavuz@HSBC.COM.TR
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Bora Yavuz <BoraYavuz@HSBC.COM.TR>
Subject:      On Hosmer-Lemeshow, etc. and Model Selection
Content-Type: text/plain; charset=Windows-1254

Hi,

May I explain things first -- 'cos I know the wise stat guys out there (David, Peter, ...) will beat me up for further info if I don't. :-)

--> I use PROC LOGISTIC for propensity / response modelling -- i.e., I aim to predict who will respond to a specified marketing action. We normally have a 0-1 response variable and use "dummy variable coding" to build (a) scorecard(s) using logistic regression.

--> Our data sets typically contain 100 to 200k observations and around 500 columns. The response rates vary between 1% to 20%.

--> For reasons that are yet unclear to me (unclear since we do not conduct any surveys), we sometimes do "oversampling" too -- i.e., we do stratified sampling on two or three variables (putting a cap on the number of non-responders in some strata) in which case we use PROC LOGISTIC with the WEIGHT statement.

My (zillions of) questions are as follows:

--> How seriously should I take the Hosmer-Lemeshow statistic in assessing goodness-of-fit? Some folks say it does not mean anything in the case of using sampling weights. Some folks say since all my variables (both dependent and independent) are dummies, it is of little use if any at all. I'm confused.

--> Similarly, is the "classification table" output meaningless in the case of sampling weights? If so how can we adjust?

--> I ended up with one model (Model A) with a perfect Hosmer-Lemeshow statistic (p value around .46) in the development sample and with Gini values in the development and test samples being 68 and 69.16 respectively. Also the scores align in both samples (though it isn't very smooth -- some score bands have similar response rates). Another model (Model B) has a Hosmer-Lemeshow p-value of ~.13 (still legal) and higher Gini's in both

development and test samples: 69 and 69.72 respectively. Also there is

some misalignment in both samples for low score bands. Yet another model (Model C) messes up in Hosmer-Lemeshow (p-value < 0.01) and depicts slight misalignment in both samples too. However it has higher and closer Gini values: 69.72 and 69.84. In complexity, Model C > Model B > Model A (">" meaning "is more complex than"). Moreover bootstrap estimates and confidence intervals are more stable in C then in B and A (C > B > A). So how would you decide? (I'm sort of inclined to choose B.)

--> Since our purpose is prediction and we do not aim to explain anything, why don't we just "throw in" as many variables as possible and / or choose the AIC-optimal (or the AIC[3/2]-optimal) model? The papers I read on the topic suggest including as many variables as possible and / or average all alternative models' coefficients (or probability estimates). Is this true? [By the way, I threw away the highly-correlated variables before attempting model building.]

--> Again, since we do prediction we do not have to worry about severe multi-collinearity or do we? [My boss insists we use PROC REG with the

TOL, VIF and COLLINOINT options to diagnose and later eliminate multi-collinearity.]

--> And finally, had I better use PROCSURVEYLOGISTIC as opposed to PROC

LOGISTIC with the WEIGHT statement? Is it any btter? If so, what is the "OUTEST=" equivalent in PROC SURVEYLOGISTIC (I couldn't find it in the docs)?

--> What are your guidelines for model selection? Which criteria do you consider most critical in choosing the "best" model for prediction? How would you avoid overfitting or underfitting?

So many thanks in advance,

Bora Y.

Bu E-posta mesajı gizlidir. Ayrıca hukuken de gizli olabilir. Mesajın gönderilmek istendiği kişi siz değilseniz hiçbir kısmını kopyalayamaz, başkasına gönderemez, başkasına açıklayamaz veya kullanamazsınız. Eğer bu mesaj size yanlışlıkla ulaşmışsa, lütfen mesajı ve tüm kopyalarını sisteminizden silin ve gönderen kişiyi E-posta yolu ile bilgilendirin.

İnternet iletişiminde zamanında, güvenli, hatasız ya da virüssüz gönderim garanti edilemez. Gönderen taraf hata veya unutmalardan sorumluluk kabul etmez. ******************************************************************** This E-mail is confidential. It may also be legally privileged. If you are not the addressee you may not copy, forward, disclose or use any part of it. If you have received this message in error, please delete it and all copies from your system and notify the sender immediately by return E-mail.

Internet communications cannot be guaranteed to be timely, secure, error or virus-free. The sender does not accept liability for any errors or omissions. ********************************************************************

(Embedded image moved to file: pic01799.pcx)


Back to: Top of message | Previous page | Main SAS-L page