Date: Tue, 22 Jan 2002 13:27:15 -0800
Reply-To: Dale McLerran <stringplayer_2@YAHOO.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Dale McLerran <stringplayer_2@YAHOO.COM>
Subject: Re: Dummy variables in regressions
Content-Type: text/plain; charset=us-ascii
--- Roger Lustig <rlustig@CBDCREDIT.COM> wrote:
> If you have three segment types, you only need two dummies! After
> if it's not seg1 or seg2, it's necessarily seg3.
> If you put in a third dummy, the regression you run will wind up
> to divide things by zero, and will produce error messages and similar
> unpleasantness. (Think of solving three simultaneous equations with
> three unknowns, where two of the equations are the same.)
True if you fit the regression attempting to use the unique inverse of
X'X. When there is a column of X which is linearly dependent on other
columns of X, then a unique inverse does not exist and all these nasty
consequences which you mention would come to pass. However, that is
not the approach which SAS employs in any of the regression procedures
which I can think of. SAS employs a sweep operation which returns i-th
row/column elements all zero in the inverse of X'X when the i-th column
of X is a linear combination of other variables already in the model.
This is a generalized inverse solution. The generalized inverse
solution is perfectly valid. However, rearrangement of the columns
of X would return a different solution, so there is not a single
solution to the regression problem. In fact, there are an infinite
number of solutions that could be constructed for the regression
problem. The particular solution which SAS returns will yield a zero
parameter estimate for each column which is linearly dependent on
other columns which precede it in the design matrix. SAS may then
label the estimates for variables which are involved in some linear
dependency as being biased.
> Now the good news: you don't need to use dummy variables at all!
> proc glm;
> class segtype;
> model Y = x1 x2 x3 segtype;
> PROC GLM will do the dummying for you. (So will CATMOD, LOGISTIC,
> GENMOD, etc. when you use them. In fact, you can use formats to
> values into categories in all those procs.)
And when you fit this code, you will observe that the design matrix
was constructed with a column representing all three segment types,
yet the regression model did not self-destruct. What this approach
gains over what Yvette has previously attempted to do on her own is
to remove the need for her to construct the dummy variables at all.
The class statement in the regression procedure will construct those
dummy variables for her, and enter all of the dummy variables in the
model. The code is simpler to implement, and I heartily endorse
the use of class variables to generate all of the necessary indicator
variables! However, nothing is gained computationally.
Fred Hutchinson Cancer Research Center
Ph: (206) 667-2926
Fax: (206) 667-5977
Do You Yahoo!?
Send FREE video emails in Yahoo! Mail!