Date: Wed, 25 Apr 2007 19:44:39 +0530
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Cherish <cherish@GLOBAL-ANALYTICS.COM>
Subject: Re: Modeling Problem because of sampling
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
I have scaled down variables using stepwise regression (from 1000 to 15)
then on this variable list I am using best option of logistic
regression. I am building the model on Training data and cross checking
is done on validation data using KS statistic. I know I have used
stepwise to scale down variables, may be I have to use Proc factor to
scale down. My dataset consists of lots of correlated variables and I
have to scale down variables for sure and my DV as I said earlier is
bernouli DV (whether a customer defaults in X days or not) any other
method to do. I have one more doubt. How does SAS compute score chi
square value for subset models using BEST option? My doubt is with more
variables in a model, obviously chi square value will be more for the
training sample but the same may not be true for validation sample. I
want to compute the value for Validation sample also. If any body has a
macro for the same please let me know.
Thanks in advance
David L Cassell wrote:
> email@example.com wrote:
>> Hi All,
>> I am building a model where I split the dataset into training and
>> validation sample (66% and 34%). The DV of the sample is whether a
>> customer defaults in X days. The splitting of the sample is done by SRS
>> method of Proc survey select. We find that the performance of the model
>> is very sample dependant (ie seed of surveyselect). We are getting huge
>> overfit w.r.t KS (kolmogorav smirnov) metric. How to avoid this
>> situation? How to get a best split where there is no overfit and there
>> is no compromise on performance also.
>> Thanks in advance.
> Let me take a wild guess. I look into my crystal ball and see...
> that you are performing logistic regression using stepwise selection
> on a very large set of regressors.
> That's my guess.
> Your metric is not helping any.
> Your sampling is not the problem.
> The problem is the unstable model fits you are getting because of
> your methodology. As a result, you are getting overfits and unstable
> Please write back to SAS-L and explain in more detail just how far off
> the truth I actually am, and what you are really doing. Then someone
> here ought to be able to give you better advice.
> David L. Cassell
> mathematical statistician
> Design Pathways
> 3115 NW Norwood Pl.
> Corvallis OR 97330
> Don't quit your job - Take Classes Online and Earn your Degree in 1 year.
> Start Today!