Date: Tue, 15 Jan 2008 13:28:05 -0500
Reply-To: Peter Flom <firstname.lastname@example.org>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Peter Flom <peterflomconsulting@MINDSPRING.COM>
Subject: Re: Is GENMOD Stuck???
Content-Type: text/plain; charset=UTF-8
Wensui Liu <liuwensui@GMAIL.COM> wrote
>1-2% sample is a very interesting point I've ever seen.
>The guideline I usually follow is picked up from Hastie 'elements of
>statistical learning', which says 50% for training, 25% for
>validation, and 25% for testing. He could be wrong though. ^_^.
>It seems different games have different rules.
Hastie isn't wrong, and E of SL is a great book.
But he's answering a different question. His guidelines are about how to split up reasonably sized data sets.
What is 'reasonable' - well, depends on your field, the complexity of your model, and computing power. But it's hard to see a case where millions of observations would do anything except slow down the computer. Sure, it makes for more precise estimates, but how precise do you ever need an estimate to be? Even if your model is very very accurate, the model error is going to totally swamp the sampling error with millions of records.