Date: Fri, 9 Dec 2005 10:52:38 -0500
Reply-To: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject: Re: Subject: Logist Model Build--How big a dataset to use
Content-Type: text/plain; charset="us-ascii"
Actually the polymath Jerome Friedman (who must program statistical
algorithms while sleeping) has recently added (with co-author Bogdan
Popescu) predictive learning with rule ensembles to his and colleague's
TreeNet, MART, and CART programs (to name a few). Learning of rule
ensembles indeed does generate 'base learners' randomly as well as
random sub-samples. Combining results obtained from many small
subsamples, according to the authors, reduces correlations among
ensemble members. The authors suggest that the same procedure for
learning from rule ensembles could be applied with other base learners,
but admit that so far only with decision trees.
Sig
-----Original Message-----
From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
On Behalf Of David L Cassell
Sent: Thursday, December 08, 2005 7:40 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Subject: Logist Model Build--How big a dataset to use
Sig sagely replied:
>What about the idea of taking a random sample of variables? That would
>take care of those pesky problems with step-wise selection!
And, of course, do it with PROC SURVEYSELE... Oh you're kidding.
:-) :-)
>On a more serious note, I do see data mining experts' advising
>colleagues to sample rows (observations) of very large datasets and use
>the sample to develop statistical models. In fact, it's the first step
>in SAS's recommended SEMMA strategy (sample, explore, modify, model,
>assess), though described as a result of representative sampling.
>(Don't know how that works when already has a dataset to analyze.)
Well, that IS using proc surveyselect. Or something like it. The
stumbling block, IMHO, is the 'representative' part. Everyone who
starts with data sets too large to chuck whole into, say, PROC CLUSTER,
tends to do simple random sampling, instead of something possibly more
efficient.. or more robust. Control sampling or sampling with
multipliers or sampling with strata. Any of these, given the situation,
might do a better job of getting a better spread of the data, and making
sure to sample the entire data space better. SRS does tend to leave
holes here and there, and give you lumps in other places. It's the
nature of the beast.
>Now it does make sense when the number of observations allows to divide
>a data source randomly into training, test, and validation samples, and
>set the test and validation samples aside (no peeking). Even so, that
>may still leave a lot of observations to process.
There are always iterative solutions which take a new random sample on
every iteration. Stochastic gradient boosting falls into this category.
David
--
David L. Cassell
mathematical statistician
Design Pathways
3115 NW Norwood Pl.
Corvallis OR 97330
_________________________________________________________________
On the road to retirement? Check out MSN Life Events for advice on how
to get there! http://lifeevents.msn.com/category.aspx?cid=Retirement