LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (December 2005, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Fri, 9 Dec 2005 10:52:38 -0500
Reply-To:     Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject:      Re: Subject: Logist Model Build--How big a dataset to use
Comments: To: David L Cassell <davidlcassell@msn.com>
Content-Type: text/plain; charset="us-ascii"

Actually the polymath Jerome Friedman (who must program statistical algorithms while sleeping) has recently added (with co-author Bogdan Popescu) predictive learning with rule ensembles to his and colleague's TreeNet, MART, and CART programs (to name a few). Learning of rule ensembles indeed does generate 'base learners' randomly as well as random sub-samples. Combining results obtained from many small subsamples, according to the authors, reduces correlations among ensemble members. The authors suggest that the same procedure for learning from rule ensembles could be applied with other base learners, but admit that so far only with decision trees. Sig

-----Original Message----- From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu] On Behalf Of David L Cassell Sent: Thursday, December 08, 2005 7:40 PM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Subject: Logist Model Build--How big a dataset to use

Sig sagely replied: >What about the idea of taking a random sample of variables? That would >take care of those pesky problems with step-wise selection!

And, of course, do it with PROC SURVEYSELE... Oh you're kidding. :-) :-)

>On a more serious note, I do see data mining experts' advising >colleagues to sample rows (observations) of very large datasets and use

>the sample to develop statistical models. In fact, it's the first step >in SAS's recommended SEMMA strategy (sample, explore, modify, model, >assess), though described as a result of representative sampling. >(Don't know how that works when already has a dataset to analyze.)

Well, that IS using proc surveyselect. Or something like it. The stumbling block, IMHO, is the 'representative' part. Everyone who starts with data sets too large to chuck whole into, say, PROC CLUSTER, tends to do simple random sampling, instead of something possibly more efficient.. or more robust. Control sampling or sampling with multipliers or sampling with strata. Any of these, given the situation, might do a better job of getting a better spread of the data, and making sure to sample the entire data space better. SRS does tend to leave holes here and there, and give you lumps in other places. It's the nature of the beast.

>Now it does make sense when the number of observations allows to divide

>a data source randomly into training, test, and validation samples, and

>set the test and validation samples aside (no peeking). Even so, that >may still leave a lot of observations to process.

There are always iterative solutions which take a new random sample on every iteration. Stochastic gradient boosting falls into this category.

David -- David L. Cassell mathematical statistician Design Pathways 3115 NW Norwood Pl. Corvallis OR 97330

_________________________________________________________________ On the road to retirement? Check out MSN Life Events for advice on how to get there! http://lifeevents.msn.com/category.aspx?cid=Retirement


Back to: Top of message | Previous page | Main SAS-L page