LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (February 2009, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 25 Feb 2009 15:56:00 -0500
Reply-To:     Peter Flom <peterflomconsulting@mindspring.com>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Peter Flom <peterflomconsulting@MINDSPRING.COM>
Subject:      Re: oversampling too much?
Comments: To: Gary <fuguoyi@GMAIL.COM>
Content-Type: text/plain; charset=UTF-8

I wrote > >> I am not an expert on this area, but >> 1) I don't see how oversampling from an existing data set helps. I could see >> oversampling when *building* a data set. You want to oversample rare populations so that >> you have enough people from those populations. But in your situation, I think the >> only advantage of oversampling would be the speed with with the logistic regression runs. >> >> (That's just my intuition ....) >

Gary replied

>When I used oversampling, I can select about 20 variables after >backward logistic; however, if not-oversamples, I can only get 4 >variables. It seems oversampling can help get the features. >

Backward selection is a good way to get too many variables in the model, and all the results from backward (or forward, or stepwise) are wrong: p values are too low, standard errors are too small, estimates are biased away from 0. You will certainly get a more COMPLEX model. That doesn't mean you will get a better one.

>> 2) I am concerned with any model that has 500 variables, *regardless* of the number of cases. >> The rule of thumb of 10-1 is not bad, but it's not ironclad. What are these 500 variables? How are they related? > >Yes, we have 500+ variables available, including demographic data, >bank accout information, transactions history, etc. The final model >contains about 8 variables which were selected using business sense, >VARCLUS, LOGISTICS backward, etc. >

I am still skeptical that there are 500 things about a person that would be helpful for anything.

Backwards logistic (or OLS, or anything) - see above.

VARCLUS can be interesting with large data sets like this. But I would then use the results of VARCLUS as input to a regression.

>> 3) Since you are in marketing, I imagine you are mainly or entirely interested in prediction, rather than explanation. You might consider multimodel averzging (see a book by Burnham and Anderson) > >Yes, we are predicting whether the customer will response to a mail >offer or not.

Take a look at these two books:

Harrell: Regression Modeling Strategies Burnham and Anderson: Model selection and multimodel averaging

Very valuable, in my opinion.

For a much shorter take on why stepwise methods are bad, see a talk that I gave with David Cassell at various user group meetings. It's available various places, e.g.

http://www.nesug.org/Proceedings/nesug07/sa/sa07.pdf

Peter

Peter L. Flom, PhD Statistical Consultant www DOT peterflomconsulting DOT com


Back to: Top of message | Previous page | Main SAS-L page