Date: Tue, 4 Mar 2003 12:54:42 -0500
Reply-To: Peter Flom <flom@NDRI.ORG>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Peter Flom <flom@NDRI.ORG>
Subject: Overfitting references
Content-Type: text/plain; charset=US-ASCII
Apologies for cross posting
Earlier today, a colleague told me she had read an article that used
backward elimination in linear regression. There were 52 cases and 15
variables. (!) She asked me if that was a problem. I gave her an
emphatic yes.
But it got me thinking.
I have seen some rules of thumb for 'number of cases per variable" in
regression. But is there much empirical literature on how regressions
perform in various combinations of N and number of IVs and different
model selection methods?
e.g., one interesting idea is to use random data, and see how often
different p-values are obtained with different N, different numbers of
IVs and different selection methods.
Any pointers to existing literature would be appreciated
Thanks
Peter
Peter L. Flom, PhD
Assistant Director, Statistics and Data Analysis Core
Center for Drug Use and HIV Research
National Development and Research Institutes
71 W. 23rd St
New York, NY 10010
(212) 845-4485 (voice)
(917) 438-0894 (fax)