Date: Thu, 24 Feb 2005 17:46:00 -0800
Reply-To: cassell.david@EPAMAIL.EPA.GOV
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: "David L. Cassell" <cassell.david@EPAMAIL.EPA.GOV>
Subject: Re: Relative Importance of Explanatory Variables,
Standardized Coefficients, STB option, etc.
In-Reply-To: <200502240134.j1O1YWee031634@listserv.cc.uga.edu>
Content-type: text/plain; charset=US-ASCII
Talbot Michael Katz <topkatz@MSN.COM> wrote:
> I'm gathering opinions, facts, anecdotes, etc., and what better place
to
Are foaming-at-the-mouth rants okay too?
> start than with SAS-L? Today's number one question is this:
>
> What is the "best" way to measure the relative importance of
explanatory
> variables in a model when the model includes class variables?
There isn't a 'best' way to measure relative importance even before
you add in class variables. See my previous rants and complaints on
this subject, all lovingly preserved in the SAS-L archives.
> Let's start with OLS. Textbooks often say that the standardized
regression
> coefficients ("betas") measure the relative importance, and that's an
And it's wrong. Find me a textbook which says specifically that
"standardized regression coefficients measure the relative importance"
and I'll show you a textbook not written by a statistician.
> appealingly intuitive picture; if all the variables are placed on the
same
> scale, then the betas show the effects of a unit change in any of the
> variables. This is even reasonable for logistic models. The SAS
It's appealing. Which is why so many statisticians have had to point
out
that the natural, intuitive interpretation only works in the simplest
cases. If you have orthogonal variables and no interaction terms,
you're good to go. You can get that if you build your own experimental
designs, for instance. Otherwise, you have to deal with all manner of
correlations, multi-collinearity, suppressor variables, measurement
errors,
etc. And the 'intuitive' idea falls apart.
> regression procedures will output the betas if the STB option is
requested
> (the Enterprise Miner regression node outputs "Standardized Estimates"
as a
> matter of course). However, it is documented in SAS and has been
remarked
> in other threads here, that betas are not computed for class
variables,
> which is reasonable because class variables cannot be standardized to
the
> normal distribution. But certainly the concept of relative importance
> should still apply to class variables, so how do you measure it? This
> question is particularly resonant for Enterprise Miner users, since
the
> variable selection node tends to turn all significant variables into
class
> variables.
You might start with some of the works of William Kruskal. (That's THE
Kruskal of statistics.) His most accessible works on the subject are
both
out of The American Statistician, which is not a *technical* stats
journal.
Look up Vol 41 (Feb 1987) and Vol 43 (Feb 1989). Kruskal and Majors
describe
'relative importance' using the phrase 'inherently vague concept of
neodescriptive statistics'. Like that one? I do. Evan Williams (and
others)
have pointed out that relationships among the independent variables
'lead one
to question use of association measures from the bivariate marginals.'
In economics, you sometimes see people take 'relative importance' to be
proportional to beta_i * mu_i, where beta is the NON-standardized
regression
coefficient and mu is the corresponding expectation. The interpretation
of
the product is in terms of relative increase in the expected value when
X_i
is increased by 1% of mu_i. So here's a popular setting where the use
of
the standardized coefficient is considered to be sub-optimal (at least).
And
this still does nothing to address the probelsm I mention above.
> I have been using the square roots of the Wald chi-square values (I
call
> them "Wald t values," but I don't know if that's widely accepted
> terminology). For a univariate regression model, I believe this t
value is
> equal to the beta value, so it seems like a reasonable proxy. Do you
> agree? Do you have any other ideas? I found a paper from the
journal,
> Decision Sciences, that studies this issue in more depth (the authors
don't
> like the use of betas or t values or p values, etc., for measuring
relative
> importance): (http://home.wi.rr.com/jjrr/dsj.pdf) "A Framework for
> Measuring the Importance of Variables with Applications to Management
> Research and Decision Models," E.S. Soofi, J.J. Retzer, M.
Yasai-Ardekani,
> Decision Sciences, Volume 31, Number 3, Summer 2000.
I'm not a big fan of this, but I really ought to look further into it.
Try looking at Kruskal's 1987 American Statistician paper too.
The problem is that for a univariate regression model, a LOT of things
ought
to work, when they won't work in a more complicated setting. Even
Kruskal's
approach (of averaging squared partial correlation coefficients over all
possible
orderings of the independent variables) doesn't work when you introduce
issues
of measurement error and other painfully real problems in actual data.
Kruskal has an example where things fail when the model is as simple as
Y = X plus an uncorrelated noise variable.
Evan Williams once wrote "Concepts of relative importance are generally
without
meaning unless there is a specific 'natural' ordering of the regression
variables."
Everyone insists that there has to be a 'most important' variable, and
on down,
but assigning such measures is often guesswork, and bad guesswork at
that.
We all know that stepwise regression procedures fail to come up with
ideal
sets of independent variables.. and they're not even trying to rank the
darn
things, much less fully account for all multi-collinearity. And typical
statistical apporaches like these pretend that there's no such thing as
differing sizes of measurement errors, etc., etc.
So, in short, "you can't get there from here." Sorry.
David
--
David Cassell, CSC
Cassell.David@epa.gov
Senior computing specialist
mathematical statistician
|