Date: Wed, 25 Feb 2009 15:56:00 -0500
Reply-To: Peter Flom <peterflomconsulting@mindspring.com>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Peter Flom <peterflomconsulting@MINDSPRING.COM>
Subject: Re: oversampling too much?
Content-Type: text/plain; charset=UTF-8
I wrote
>
>> I am not an expert on this area, but
>> 1) I don't see how oversampling from an existing data set helps. I could see
>> oversampling when *building* a data set. You want to oversample rare populations so that
>> you have enough people from those populations. But in your situation, I think the
>> only advantage of oversampling would be the speed with with the logistic regression runs.
>>
>> (That's just my intuition ....)
>
Gary replied
>When I used oversampling, I can select about 20 variables after
>backward logistic; however, if not-oversamples, I can only get 4
>variables. It seems oversampling can help get the features.
>
Backward selection is a good way to get too many variables in the model, and all the results from backward (or forward, or stepwise) are wrong: p values are too low, standard errors are too small, estimates are biased away from 0. You will certainly get a more COMPLEX model. That doesn't mean you will get a better one.
>> 2) I am concerned with any model that has 500 variables, *regardless* of the number of cases.
>> The rule of thumb of 10-1 is not bad, but it's not ironclad. What are these 500 variables? How are they related?
>
>Yes, we have 500+ variables available, including demographic data,
>bank accout information, transactions history, etc. The final model
>contains about 8 variables which were selected using business sense,
>VARCLUS, LOGISTICS backward, etc.
>
I am still skeptical that there are 500 things about a person that would be helpful for anything.
Backwards logistic (or OLS, or anything) - see above.
VARCLUS can be interesting with large data sets like this. But I would then use the results of VARCLUS as input to a regression.
>> 3) Since you are in marketing, I imagine you are mainly or entirely interested in prediction, rather than explanation. You might consider multimodel averzging (see a book by Burnham and Anderson)
>
>Yes, we are predicting whether the customer will response to a mail
>offer or not.
Take a look at these two books:
Harrell: Regression Modeling Strategies
Burnham and Anderson: Model selection and multimodel averaging
Very valuable, in my opinion.
For a much shorter take on why stepwise methods are bad, see a talk that I gave with David Cassell at various user group meetings. It's available various places, e.g.
http://www.nesug.org/Proceedings/nesug07/sa/sa07.pdf
Peter
Peter L. Flom, PhD
Statistical Consultant
www DOT peterflomconsulting DOT com