LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (February 2009, week 4)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 25 Feb 2009 10:34:33 -0800
Reply-To:     Jeff <jeffrey.m.allard@GMAIL.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Jeff <jeffrey.m.allard@GMAIL.COM>
Organization: http://groups.google.com
Subject:      Re: oversampling too much?
Comments: To: sas-l@uga.edu
Content-Type: text/plain; charset=ISO-8859-1

On Feb 25, 8:54 am, Gary <fugu...@gmail.com> wrote: > I am new to this group, and just started a job with a bank. When > modeling rare events in marketing, it has been suggested by many to > take a sample stratified by the dependent variable(s) in order to > allow the modeling technique a better chance of detecting a > difference. Many literature suggests the proportion of the event in > the sample seems to range between 15-50% for a binary outcome, and we > can use an offset to adjust it. > > The response rate of my current case is 0.3%, and when I build the > model, I oversmapled the response to 25%. However, the trandition here > is to oversample to 1%, and they told me that if oversample too much, > the model will be sensitive. > > Is there any problem oversample from 0.3% (8000 out of 2.2M targets) > to 25% (8000 resps and 24000 non-resps). We have about 500 variables > to build the model. > > Thanks for your answer.

Gary-

Oversampling as you describe is indeed often useful for rare events and sometimes can be vital. There are a lot of references on this subject (Gary King comes to mind). This is a good example of statistics versus data mining (the former not always seeing the value) : -). It really does depend on the model or algorithm you are using. For example, something like a decision tree, which seeks to make a prediction based on minimizing cost or error will always predict the non rare class and be almost always right (not a very interesting model). So you have to include a cost matrix or "over sample". I have found benefits of oversampling rare events for logistic regression as well.

Couple thoughts: 1) You need to adjust the predicted rates of the events (the predicted probability in logistic regression) to reflect the population. You can adjust the intercept in logistic regression or use surveylogistic (stratified sample). 2) I see a lot of authors in the DM sphere talking about balanced samples as being the best. Others say 20-30%. In my experience, I treat this as another parameter and test the best sampling rate using the lift in the hold out group.

HTH


Back to: Top of message | Previous page | Main SAS-L page