LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (April 2007, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 25 Apr 2007 19:44:39 +0530
Reply-To:     cherish@global-analytics.com
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Cherish <cherish@GLOBAL-ANALYTICS.COM>
Subject:      Re: Modeling Problem because of sampling
Comments: To: David L Cassell <davidlcassell@MSN.COM>
In-Reply-To:  <BAY103-F2871CEFE28BC8A04294D05B04A0@phx.gbl>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

I have scaled down variables using stepwise regression (from 1000 to 15) then on this variable list I am using best option of logistic regression. I am building the model on Training data and cross checking is done on validation data using KS statistic. I know I have used stepwise to scale down variables, may be I have to use Proc factor to scale down. My dataset consists of lots of correlated variables and I have to scale down variables for sure and my DV as I said earlier is bernouli DV (whether a customer defaults in X days or not) any other method to do. I have one more doubt. How does SAS compute score chi square value for subset models using BEST option? My doubt is with more variables in a model, obviously chi square value will be more for the training sample but the same may not be true for validation sample. I want to compute the value for Validation sample also. If any body has a macro for the same please let me know.

Regards Cherish

Thanks in advance

David L Cassell wrote:

> cherish@global-analytics.com wrote: > >> >> Hi All, >> >> I am building a model where I split the dataset into training and >> validation sample (66% and 34%). The DV of the sample is whether a >> customer defaults in X days. The splitting of the sample is done by SRS >> method of Proc survey select. We find that the performance of the model >> is very sample dependant (ie seed of surveyselect). We are getting huge >> overfit w.r.t KS (kolmogorav smirnov) metric. How to avoid this >> situation? How to get a best split where there is no overfit and there >> is no compromise on performance also. >> >> Thanks in advance. >> >> Regards >> Cherish > > > Let me take a wild guess. I look into my crystal ball and see... > that you are performing logistic regression using stepwise selection > on a very large set of regressors. > > That's my guess. > > Your metric is not helping any. > > Your sampling is not the problem. > > The problem is the unstable model fits you are getting because of > your methodology. As a result, you are getting overfits and unstable > models. > > Please write back to SAS-L and explain in more detail just how far off > from > the truth I actually am, and what you are really doing. Then someone > here ought to be able to give you better advice. > > HTH, > David > -- > David L. Cassell > mathematical statistician > Design Pathways > 3115 NW Norwood Pl. > Corvallis OR 97330 > > _________________________________________________________________ > Don't quit your job - Take Classes Online and Earn your Degree in 1 year. > Start Today! > http://www.classesusa.com/clickcount.cfm?id=866146&goto=http%3A%2F%2Fwww.classesusa.com%2Ffeaturedschools%2Fonlinedegreesmp%2Fform-dyn1.html%3Fsplovr%3D866144 > > >


Back to: Top of message | Previous page | Main SAS-L page