```Date: Tue, 22 Jan 2002 13:27:15 -0800 Reply-To: Dale McLerran Sender: "SAS(r) Discussion" From: Dale McLerran Subject: Re: Dummy variables in regressions Comments: To: Roger Lustig In-Reply-To: <3C4DC55B.2000300@cbdcredit.com> Content-Type: text/plain; charset=us-ascii --- Roger Lustig wrote: > Yvette: > If you have three segment types, you only need two dummies! After > all, > if it's not seg1 or seg2, it's necessarily seg3. True. > > If you put in a third dummy, the regression you run will wind up > trying > to divide things by zero, and will produce error messages and similar > unpleasantness. (Think of solving three simultaneous equations with > three unknowns, where two of the equations are the same.) True if you fit the regression attempting to use the unique inverse of X'X. When there is a column of X which is linearly dependent on other columns of X, then a unique inverse does not exist and all these nasty consequences which you mention would come to pass. However, that is not the approach which SAS employs in any of the regression procedures which I can think of. SAS employs a sweep operation which returns i-th row/column elements all zero in the inverse of X'X when the i-th column of X is a linear combination of other variables already in the model. This is a generalized inverse solution. The generalized inverse solution is perfectly valid. However, rearrangement of the columns of X would return a different solution, so there is not a single solution to the regression problem. In fact, there are an infinite number of solutions that could be constructed for the regression problem. The particular solution which SAS returns will yield a zero parameter estimate for each column which is linearly dependent on other columns which precede it in the design matrix. SAS may then label the estimates for variables which are involved in some linear dependency as being biased. > > Now the good news: you don't need to use dummy variables at all! > > proc glm; > class segtype; > model Y = x1 x2 x3 segtype; > run; > quit; > > PROC GLM will do the dummying for you. (So will CATMOD, LOGISTIC, > GENMOD, etc. when you use them. In fact, you can use formats to > group > values into categories in all those procs.) And when you fit this code, you will observe that the design matrix was constructed with a column representing all three segment types, yet the regression model did not self-destruct. What this approach gains over what Yvette has previously attempted to do on her own is to remove the need for her to construct the dummy variables at all. The class statement in the regression procedure will construct those dummy variables for her, and enter all of the dummy variables in the model. The code is simpler to implement, and I heartily endorse the use of class variables to generate all of the necessary indicator variables! However, nothing is gained computationally. > > Best, > > Roger Dale ===== --------------------------------------- Dale McLerran Fred Hutchinson Cancer Research Center mailto: dmclerra@fhcrc.org Ph: (206) 667-2926 Fax: (206) 667-5977 --------------------------------------- __________________________________________________ Do You Yahoo!? Send FREE video emails in Yahoo! Mail! http://promo.yahoo.com/videomail/ ```

Back to: Top of message | Previous page | Main SAS-L page