Date: Mon, 12 May 2008 06:19:44 -0400
Reply-To: Peter Flom <firstname.lastname@example.org>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Peter Flom <peterflomconsulting@MINDSPRING.COM>
Subject: Re: Finding Effectiveness of each variable
Content-Type: text/plain; charset=UTF-8
I am getting confused.
In the message immediately below, you imply that the DV is number of people churning in a region, but in the message at the bottom, you say it amount of usage. If the DV is a count, then (unless it is pretty large) you probably want to look at a count regression model.
Then there's the problem of level --- It is people who churn, correct? By looking at data only at the regional level, I think you run into problems of ecological inference --- that is, you are trying to say something about how individuals behave from data on how groups of people behave. Dangerous.
Then there's the problem that you have multiple data points per region.... you mentioned that your boss is fine with assuming independence, but this seems dangerous, as well.
Finally (at least for now) there's the question of variable selection (although, really, all the other questions need to be answered first). GLMSELECT is a good tool for variable selection when a GLM model is appropriate. But I am not even sure you want a multiple variable solution at all .... what about evaluating each variable on its own?
cherish k <hawks_cherish@YAHOO.CO.IN> wrote
>The task at hand for me is to rank complaints in order of their importance to number of people churning in a region. But please note that I am not working at customer level. The reason is we see hardly few complainants churn. So treating complaints as general level of discontent among customers we want to see if (identify which) complaints has strong relation with # of people churning in a region.
>An initial test (used a subset of complaints only) yielded a pretty decent model (r sqr = 0.49) by using proc reg - stepwise (I know I shouldn't be using stepwise, but for testing purposes and to see if the hypothesis is working well, I used) and the complaints that came up (or entered the stepwsie) model also made sense.
>Since the results are promising I want to pursue this further. I read an article written by Peter and David suggesting the use of proc GLMSELECT as better alternative to proc reg - stepwise using its LASSO and LAR options.
>Can I use proc GLMSELECT in the current context or are there better alternatives?
>Arthur Tabachneck <art297@NETSCAPE.NET> wrote: Cherish,
>I've been reading the discussion you and Peter have been having and, while
>it first sounded like a question of measuring variables impact, it is
>starting to sound more like a classic churn question.
>Have you looked into possible data mining-type solutions, such as decision
>tree, logistic regression, or neural networks modeling. In short, not
>looking to discover to the contribution of each variable, but under which
>scenarios are people most likely to attrite.
>On Sun, 11 May 2008 08:13:56 +0100, cherish k
>>Thanks for pointing to the article.
>>From all the articles what I could gather is its almost impossible to
>rank variables if they are more than 10 because of the computation
>>But I somehow want to do the following. I have complaints data which has
>close to some 300 complaints all together. I want to establish a
>correlation of people attriting to the complaints (not necessarily that
>the person complaining need to attrite). So i am trying to accumulate the
>data at region level and also each complaints at region level.
>>So I have for every month, region, number of people attrited, 300
>variables (complaints), each having the count of each complaint and I have
>data for 1 year time period (which in turn means 12 records per region).
>>From these available information I want know which are the top reasons
>because of which many people attrite?
>>Which inturn requires me to know what is the weight (importance) of each
>variable which I will multiply with the count of complaints for every
>month and know how the complaints are varying (doing) with each month.
>>One strict no - no method is stepwise regression. Are there any
>>Can you please point to any approximate method of what I want to do?
> wrote: Cherish
>>Item 167824 in the SAS-L archives at http://www.lexjansen.com/sugi/
>>or do the following google search
>>cassell kruskal katz sas-l "relative importance"
>>>From: cherish k
>>>Sent: May 10, 2008 2:26 PM
>>>Subject: Re: Finding Effectiveness of each variable
>>>Can somebody please point me to the article written by David.
>> wrote: cherish k wrote
>>>>I have a Stats related question.
>>>>I have a dataset with variables (assume 5 IV's) already defined and DV
>is the amount of usage at Region level (it is always >= 0). Information is
>collected at month wise for each region (we have one years data). So each
>region will have 12 entries in the data.
>>>>Now through some means, I want to know which is the most significant
>variable out of all the given variables and also the weight of each
>variable contributing to the whole equation.
>>>>To achieve this I have done the following.
>>>>Since the variables are not scaled, I first Z transformed all the
>variables (including DV), so that they are all on the comparable scale
>(but Z transformation was done at each Region level). Then I ran a linear
>regression on all the variables (I have as of now run an intercept model,
>not sure if no intercept is better or not).
>>>>Since the variables are all on comparable scale, can I take the
>estimates as the weights of each variable?
>>>>Y = intercept + sigma(a(i)*x(i); where a(i) is the estimate and x(i) is
>>>>So now from the following equation a(i) can be positive negative or
>>>>So can I take the importance of each variable as abs(a(i)) and then
>rank order across the variables?
>>>>If the method is wrong can somebody please suggest a way to do it.
>>>>One obvious flaw in the above method is that I am assuming independence
>(which is ok as my boss is perfectly fine with it :-) )
>>>>Are there any other problems in the method.
>>>>Please help me. (if the method is totally wrong kindly tell me if there
>is an alternative way of doing?). I am doing it in SAS (so the proc's I
>use are proc standard and proc reg).
>>>>If am not clear with the problem, please let me know.
Peter L. Flom, PhD
www DOT peterflom DOT com