Date: Wed, 25 Apr 2007 19:44:39 +0530
Reply-To: cherish@global-analytics.com
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Cherish <cherish@GLOBAL-ANALYTICS.COM>
Subject: Re: Modeling Problem because of sampling
In-Reply-To: <BAY103-F2871CEFE28BC8A04294D05B04A0@phx.gbl>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
I have scaled down variables using stepwise regression (from 1000 to 15)
then on this variable list I am using best option of logistic
regression. I am building the model on Training data and cross checking
is done on validation data using KS statistic. I know I have used
stepwise to scale down variables, may be I have to use Proc factor to
scale down. My dataset consists of lots of correlated variables and I
have to scale down variables for sure and my DV as I said earlier is
bernouli DV (whether a customer defaults in X days or not) any other
method to do. I have one more doubt. How does SAS compute score chi
square value for subset models using BEST option? My doubt is with more
variables in a model, obviously chi square value will be more for the
training sample but the same may not be true for validation sample. I
want to compute the value for Validation sample also. If any body has a
macro for the same please let me know.
Regards
Cherish
Thanks in advance
David L Cassell wrote:
> cherish@global-analytics.com wrote:
>
>>
>> Hi All,
>>
>> I am building a model where I split the dataset into training and
>> validation sample (66% and 34%). The DV of the sample is whether a
>> customer defaults in X days. The splitting of the sample is done by SRS
>> method of Proc survey select. We find that the performance of the model
>> is very sample dependant (ie seed of surveyselect). We are getting huge
>> overfit w.r.t KS (kolmogorav smirnov) metric. How to avoid this
>> situation? How to get a best split where there is no overfit and there
>> is no compromise on performance also.
>>
>> Thanks in advance.
>>
>> Regards
>> Cherish
>
>
> Let me take a wild guess. I look into my crystal ball and see...
> that you are performing logistic regression using stepwise selection
> on a very large set of regressors.
>
> That's my guess.
>
> Your metric is not helping any.
>
> Your sampling is not the problem.
>
> The problem is the unstable model fits you are getting because of
> your methodology. As a result, you are getting overfits and unstable
> models.
>
> Please write back to SAS-L and explain in more detail just how far off
> from
> the truth I actually am, and what you are really doing. Then someone
> here ought to be able to give you better advice.
>
> HTH,
> David
> --
> David L. Cassell
> mathematical statistician
> Design Pathways
> 3115 NW Norwood Pl.
> Corvallis OR 97330
>
> _________________________________________________________________
> Don't quit your job - Take Classes Online and Earn your Degree in 1 year.
> Start Today!
> http://www.classesusa.com/clickcount.cfm?id=866146&goto=http%3A%2F%2Fwww.classesusa.com%2Ffeaturedschools%2Fonlinedegreesmp%2Fform-dyn1.html%3Fsplovr%3D866144
>
>
>
|