Date: Wed, 25 Feb 2009 10:34:33 -0800
Reply-To: Jeff <jeffrey.m.allard@GMAIL.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Jeff <jeffrey.m.allard@GMAIL.COM>
Subject: Re: oversampling too much?
Content-Type: text/plain; charset=ISO-8859-1
On Feb 25, 8:54 am, Gary <fugu...@gmail.com> wrote:
> I am new to this group, and just started a job with a bank. When
> modeling rare events in marketing, it has been suggested by many to
> take a sample stratified by the dependent variable(s) in order to
> allow the modeling technique a better chance of detecting a
> difference. Many literature suggests the proportion of the event in
> the sample seems to range between 15-50% for a binary outcome, and we
> can use an offset to adjust it.
> The response rate of my current case is 0.3%, and when I build the
> model, I oversmapled the response to 25%. However, the trandition here
> is to oversample to 1%, and they told me that if oversample too much,
> the model will be sensitive.
> Is there any problem oversample from 0.3% (8000 out of 2.2M targets)
> to 25% (8000 resps and 24000 non-resps). We have about 500 variables
> to build the model.
> Thanks for your answer.
Oversampling as you describe is indeed often useful for rare events
and sometimes can be vital. There are a lot of references on this
subject (Gary King comes to mind). This is a good example of
statistics versus data mining (the former not always seeing the
value) : -). It really does depend on the model or algorithm you are
using. For example, something like a decision tree, which seeks to
make a prediction based on minimizing cost or error will always
predict the non rare class and be almost always right (not a very
interesting model). So you have to include a cost matrix or "over
sample". I have found benefits of oversampling rare events for
logistic regression as well.
1) You need to adjust the predicted rates of the events (the predicted
probability in logistic regression) to reflect the population. You can
adjust the intercept in logistic regression or use surveylogistic
2) I see a lot of authors in the DM sphere talking about balanced
samples as being the best. Others say 20-30%. In my experience, I
treat this as another parameter and test the best sampling rate using
the lift in the hold out group.