Date: Fri, 16 Nov 2001 23:23:57 +0000 Michael Friendly "SAS(r) Discussion" Michael Friendly York University Re: shorthand in sas reg model?

In article <OFBB9DEA8A.B145D434-ON88256B06.00711ACE@rtp.epa.gov> Cassell.David@EPAMAIL.EPA.GOV (David L. Cassell) writes: |kataliu <richardliu@NORTHWESTERN.EDU> wrote: |> I am new to sas. | |I wonder if you are also new to inferential statistics, based on: | |> When I use the regression model in sas, I find that |> I have lots of independent variables. Therefore, I |> have to type each one in sas codes. |> |> For example, |> |> PROC REG DATA=...; |> MODEL TARGET = ACOL1 MTGAT DKL ...../SELECTION = STEPWISE; |> ^^^^^^^^^^^^^^^^^^^^^ |> over 200 variables with different name!! |> RUN; | |This is a problem waiting to happen. That many variables and a |stepwise selection procedure will help you to fit.. well.. most likely |a lot of garbage. Measurement error, collinearity, non-interpretability |of 'relative importance', and a host of other issues will plague you. |With 200 variables, you could have 200 sources of random noise and |you'll probably get this to fit a swell model for you [where 'swell' |is unspecified]. Such a thing has happened - and been printed for all |to see in the literature - more times than you want to know.

Below is a simple, but effective teaching example I have used for a while to demonstrate the perils of blind stepwise selection --- generate 100 N(0,1) predictors, and an independent N(0,1) y. Toss them into stepwise selection, and -- hey-- you can get an R^2 of .25 or maybe greater. But, generate two similar samples, and use the model selected by each to cross-validate the other-- whoa-- the R^2 drops to non-signifcance.

The code below depends on how the seed for the normal() function is used on your machine. I ran the first reg step once, then use the variables selected in stepwise for the last step. Another useful variation is to add 100 random N(0,1) X1-X100 predictors to a real model. Students are amazed at how often the X variables turn up among the ``real predictors''

----- stepsim.sas ---- title 'Stepwise simulation example - NO real predictors'; * Generate two sets of data: 100 random predictors, 200 observations;

data sim; array x{100} x1-x100; do testset= 1 to 2; do n=1 to 200; *-- generate the predictors-- all independent, just noise; do i=1 to 100; x(i) = normal(6752343); end; *-- generate the criterion-- no relation to any of the Xs; y = normal(7654321); output; end; end;

proc reg; by testset; model y = x1-x100 / selection=forward slentry=.05; run;

/* Now see how well each prediction equation does in the other data set. - Each model should do well on the model for which it was selected, but poorly on the other set of data */

title2 'Testing cross-validation'; proc reg data=sim; by testset; M1: model y = x13 x75 x5 x25 x82 x10 x38 x87 x94 x93 x29 x97; M2: model y = x78 x14 x30 x25 x9 x4;

-- Michael Friendly Email: friendly@yorku.ca (NeXTmail OK) Psychology Dept York University Voice: 416 736-5115 x66249 Fax: 416 736-5814 4700 Keele Street http://www.math.yorku.ca/SCS/friendly.html Toronto, ONT M3J 1P3 CANADA

Back to: Top of message | Previous page | Main SAS-L page