Date: Mon, 12 May 2008 12:57:36 +0100
Reply-To: cherish k <hawks_cherish@YAHOO.CO.IN>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: cherish k <hawks_cherish@YAHOO.CO.IN>
Subject: Re: Finding Effectiveness of each variable
In-Reply-To: <23649738.1210587585352.JavaMail.root@mswamui-backed.atl.sa.earthlink.net>
Content-Type: text/plain; charset=iso-8859-1
Hi Peter,
DV definition has changed a bit and the number of churners range between 100 to 1000, but am scaling them by region. (Initially we planned to do with change in usage as DV but now we are sticking to this).
So this is what we are doing finally.
Counting churned customers at region level and month level and then scaling them with mean = 0 and stdev = 1.
Counting the complaints at region level and month level and scaling them too.
The only assumption that we are using is independance of the complaints (it looks fair to me, like for example complaints regarding bill not sent has nothing to do with SMS not delivered. There are some complaints which can be related but they might be ok for now).
Now running GLMSELECT on the above set using LASSO.
Am getting the following result using LASSO option.
Intercept 3.00E-16 cnt_COMP1 0.025884 cnt_COMP5 0.239857 cnt_COMP12 0.143161 Am getting the following result using Stepwise proc Reg
Intercept 2.99E-16 cnt_COMP1 0.19852 cnt_COMP5 0.43173 cnt_COMP12 0.2791
I used the following GLMSELECT code
proc glmselect data=MODEL_DATA_STANDRD;
model CNT_CHRN = cnt_COMP1-cnt_COMP100
/ selection=LAR;
run;
quit;
Now I have few questions.
Which of the two estimates are more reliable? Because the ratio of estimates is huge in stepwise proc reg compared to GLMSELECT. This might be important because we might directly assign importance to the highest value of estimate.
But RSqr value are same for both. I expected them to be different (probably am missing something).
And also out of the 100 variables passed LASSO seleected only 3 variables. Obviously it is using some stopping criteria. How to increase the stopping criteria i.e. how to increase the sigma(beta_i) > s option?
Regards,
Cherish
Peter Flom <peterflomconsulting@mindspring.com> wrote: I am getting confused.
In the message immediately below, you imply that the DV is number of people churning in a region, but in the message at the bottom, you say it amount of usage. If the DV is a count, then (unless it is pretty large) you probably want to look at a count regression model.
Then there's the problem of level --- It is people who churn, correct? By looking at data only at the regional level, I think you run into problems of ecological inference --- that is, you are trying to say something about how individuals behave from data on how groups of people behave. Dangerous.
Then there's the problem that you have multiple data points per region.... you mentioned that your boss is fine with assuming independence, but this seems dangerous, as well.
Finally (at least for now) there's the question of variable selection (although, really, all the other questions need to be answered first). GLMSELECT is a good tool for variable selection when a GLM model is appropriate. But I am not even sure you want a multiple variable solution at all .... what about evaluating each variable on its own?
HTH
Peter
cherish k wrote
>
>The task at hand for me is to rank complaints in order of their importance to number of people churning in a region. But please note that I am not working at customer level. The reason is we see hardly few complainants churn. So treating complaints as general level of discontent among customers we want to see if (identify which) complaints has strong relation with # of people churning in a region.
>
>An initial test (used a subset of complaints only) yielded a pretty decent model (r sqr = 0.49) by using proc reg - stepwise (I know I shouldn't be using stepwise, but for testing purposes and to see if the hypothesis is working well, I used) and the complaints that came up (or entered the stepwsie) model also made sense.
>Since the results are promising I want to pursue this further. I read an article written by Peter and David suggesting the use of proc GLMSELECT as better alternative to proc reg - stepwise using its LASSO and LAR options.
>
>Can I use proc GLMSELECT in the current context or are there better alternatives?
>
>Regards,
>Cherish
>
>
>Arthur Tabachneck wrote: Cherish,
>
>I've been reading the discussion you and Peter have been having and, while
>it first sounded like a question of measuring variables impact, it is
>starting to sound more like a classic churn question.
>
>Have you looked into possible data mining-type solutions, such as decision
>tree, logistic regression, or neural networks modeling. In short, not
>looking to discover to the contribution of each variable, but under which
>scenarios are people most likely to attrite.
>
>HTH,
>Art
>---------
>On Sun, 11 May 2008 08:13:56 +0100, cherish k
>wrote:
>
>>Hi Peter,
>>
>>Thanks for pointing to the article.
>>
>>From all the articles what I could gather is its almost impossible to
>rank variables if they are more than 10 because of the computation
>infeasibility.
>>
>>But I somehow want to do the following. I have complaints data which has
>close to some 300 complaints all together. I want to establish a
>correlation of people attriting to the complaints (not necessarily that
>the person complaining need to attrite). So i am trying to accumulate the
>data at region level and also each complaints at region level.
>>
>>So I have for every month, region, number of people attrited, 300
>variables (complaints), each having the count of each complaint and I have
>data for 1 year time period (which in turn means 12 records per region).
>>
>>From these available information I want know which are the top reasons
>because of which many people attrite?
>>Which inturn requires me to know what is the weight (importance) of each
>variable which I will multiply with the count of complaints for every
>month and know how the complaints are varying (doing) with each month.
>>
>>One strict no - no method is stepwise regression. Are there any
>substitutes?
>>
>>Can you please point to any approximate method of what I want to do?
>>
>>Regards,
>>Cherish
>>
>>
>>Peter Flom
> wrote: Cherish
>>
>>Item 167824 in the SAS-L archives at http://www.lexjansen.com/sugi/
>>
>>or do the following google search
>>
>>cassell kruskal katz sas-l "relative importance"
>>
>>Peter
>>
>>
>>-----Original Message-----
>>>From: cherish k
>>>Sent: May 10, 2008 2:26 PM
>>>To: SAS-L@LISTSERV.UGA.EDU
>>>Subject: Re: Finding Effectiveness of each variable
>>>
>>>Can somebody please point me to the article written by David.
>>>
>>>Thanks
>>>Cherish
>>>
>>>Peter Flom
>> wrote: cherish k wrote
>>>>
>>>>I have a Stats related question.
>>>>
>>>>I have a dataset with variables (assume 5 IV's) already defined and DV
>is the amount of usage at Region level (it is always >= 0). Information is
>collected at month wise for each region (we have one years data). So each
>region will have 12 entries in the data.
>>>>
>>>>Now through some means, I want to know which is the most significant
>variable out of all the given variables and also the weight of each
>variable contributing to the whole equation.
>>>>
>>>>To achieve this I have done the following.
>>>>
>>>>Since the variables are not scaled, I first Z transformed all the
>variables (including DV), so that they are all on the comparable scale
>(but Z transformation was done at each Region level). Then I ran a linear
>regression on all the variables (I have as of now run an intercept model,
>not sure if no intercept is better or not).
>>>>
>>>>Since the variables are all on comparable scale, can I take the
>estimates as the weights of each variable?
>>>>
>>>>Y = intercept + sigma(a(i)*x(i); where a(i) is the estimate and x(i) is
>the variable
>>>>
>>>>So now from the following equation a(i) can be positive negative or
>zero.
>>>>
>>>>So can I take the importance of each variable as abs(a(i)) and then
>rank order across the variables?
>>>>
>>>>If the method is wrong can somebody please suggest a way to do it.
>>>>
>>>>One obvious flaw in the above method is that I am assuming independence
>(which is ok as my boss is perfectly fine with it :-) )
>>>>
>>>>Are there any other problems in the method.
>>>>
>>>>Please help me. (if the method is totally wrong kindly tell me if there
>is an alternative way of doing?). I am doing it in SAS (so the proc's I
>use are proc standard and proc reg).
>>>>
>>>>If am not clear with the problem, please let me know.
>>>>
>>>
Peter L. Flom, PhD
Statistical Consultant
www DOT peterflom DOT com
---------------------------------
Best Jokes, Best Friends, Best Food. Get all this and more on Best of Yahoo! Groups.