Date: Wed, 1 Sep 2010 13:30:05 -0700
Reply-To: Alex Tang <Alex.Tang@CREDITONE.COM>
Sender: "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From: Alex Tang <Alex.Tang@CREDITONE.COM>
Subject: Re: interpreting clusters
In-Reply-To: <0D6BED2F7A98414697EA36B4EC1E3C1601601E9C06@URMCMS9.urmc-sh.rochester.edu>
Content-Type: text/plain; charset="us-ascii"
I don't have SAS EM. I have the SPSS AnswerTree at hand. I have been
doing some customer segmentation projects with it and it's pretty easy
to work with. So, if there is not some kind of fundamental flaw behind
this CART analysis method, I might just give it a shot.
-----Original Message-----
From: Thevenet-Morrison, Kelly
[mailto:Kelly_Thevenet-morrison@URMC.Rochester.edu]
Sent: Wednesday, September 01, 2010 12:14 PM
To: Alex Tang
Subject: RE: interpreting clusters
Sensitivity: Confidential
I was thinking about CART or CHAID, but I wasn't sure what program you
had. If you have enterprise miner, you could easily create a decision
tree with your segments.
-----Original Message-----
From: Alex Tang [mailto:Alex.Tang@creditone.com]
Sent: Wednesday, September 01, 2010 3:05 PM
To: Thevenet-Morrison, Kelly
Cc: SAS-L@LISTSERV.UGA.EDU
Subject: RE: interpreting clusters
Sensitivity: Confidential
Kelly, it's very good point to take care of missing data and outliers in
your reply. I think it depends on the nature and scale of missing data
how I should work on it. If it's rare to see missing data, I might just
end up deleting them to make thing easy. Otherwise either getting a
filler for them or make a dummy indicator for missing data would be the
way to go.
To profile clusters, besides checking each variable mean/range, how
would you think employ some simple CART analysis by using cluster # as
DV and profile variables as IV? When I see some node in the decision
tree contains majority of a cluster and dominate that node, this could
give an idea how the customers look like and distinguish them from
others. I have never done that before, so this may sound a bit silly or
contains big flaw...
Thanks,
Alex
-----Original Message-----
From: Thevenet-Morrison, Kelly
[mailto:Kelly_Thevenet-morrison@URMC.Rochester.edu]
Sent: Wednesday, September 01, 2010 11:15 AM
To: Alex Tang
Subject: RE: interpreting clusters
Sensitivity: Confidential
You could start with a smaller subset of variables that you think are
important and build from there. Can you combine some of the variables
that are binary? Do you have any missing data? I believe that you may
need to make a filler for the missing otherwise the observation will be
deleted.
What you could do is build a frequency program in order to create your
profile descriptors. So set up a format for age range, income range,
credit history, etc.
Proc format ;
Value ager
25-35='25-35'
36-45='36-45'
Etc
Same for income
And your other variables;
Run a proc freq using your formats by your segment and output it to a
dataset or to excel. It is kind of like a quick snapshot to see how
different the distributions are from each other. This would provide you
an easier way to create cutoffs.
You could also use proc univariate with the outtable option. Using
segment as your class variable. This will show distributions for
continuous.
What is nice about the proc freq is that you use it for your categorical
variables as well as continuous and you can see what the differences are
across the segments.
How many observations do you have? If a segment pops out that is
relatively small, check for outliers in your data.
Hope that helps.
K
-----Original Message-----
From: Alex Tang [mailto:Alex.Tang@creditone.com]
Sent: Wednesday, September 01, 2010 1:58 PM
To: Thevenet-Morrison, Kelly
Cc: SAS-L@LISTSERV.UGA.EDU
Subject: RE: interpreting clusters
Sensitivity: Confidential
Kelly, thanks for your input. I am also thinking of this kind of manual
check and comparison.
This way, we will have to pull the mean (and maybe also the range?) of
all the possible/available variable for each cluster and compare across
all the clusters, right?
Now the challenge is to decide the combination of variable (and cutoff
point) to use for defining the clusters. Say, when I see cluster A have
a average age of 55, and all other clusters average 38, I would think
age could be one of variables to define cluster A. But what age should I
use as the cutoff value for cluster A? I think it could be anywhere
between 38 to 55, isn't it? Besides age, I will need to continue other
variable to see a possible cut to distinguish cluster A from others,
right? When it comes to multiple variables, is there anything I should
watch out for?
-----Original Message-----
From: Thevenet-Morrison, Kelly
[mailto:Kelly_Thevenet-morrison@URMC.Rochester.edu]
Sent: Wednesday, September 01, 2010 10:47 AM
To: Alex Tang
Subject: RE: interpreting clusters
Sensitivity: Confidential
If there is an output statement in fastclus to append the clusters to
your data you could create quick profiles to test differences for your
continuous variables using proc means with a class statement - your
segment or cluster number would be your class variable. See how
different they are. In the past I started with that and combined
clusters if they were not that different from one another.
Kelly
Kelly Thevenet-Morrison MS
Lead Programmer Analyst
Department of Community and Preventive Medicine
University of Rochester School of Medicine and Dentistry
601 Elmwood Ave., box 644
Rochester, NY 14642
Phone: 585-275-1817
e-mail: kelly_thevenet-morrison@urmc.rochester.edu
-----Original Message-----
From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
Alex Tang
Sent: Wednesday, September 01, 2010 1:37 PM
To: SAS-L@LISTSERV.UGA.EDU
Subject: interpreting clusters
Sensitivity: Confidential
We are segmenting our customers based on some profiles. Since we don't
have an explicit target right now, we are decided to do cluster analysis
at the time. There are about 2MM customers with 50-100 variables.
I understand for such a big data set, I should have do PROC FASTCLUS to
get a relative big number of preliminary cluster set first, say, 100,
then import them to PROC CLUSTER for further analysis. If necessary, a
factor analysis or principal component analysis is deemed in the front
as well.
When I am confused here is, suppose I get a final set of clusters here,
how do I interpret the clusters? It would be nice if I can describe the
clusters based on the profiles of the customers. E.g. cluster A is the
customers older than 40 years old and having an annual income less than
50k... something like that
Or for the interpretation purpose, I should refer to approach other than
cluster analysis? Either, please advise. Thank you.
******************* E-mail non-disclosure ******************
The information contained in this e-mail message may be proprietary
and/or confidential, and
protected from disclosure. If the reader of this message is not the
intended recipient,
or an employee or agent responsible for delivering this message to the
intended recipient,
you are hereby notified that any dissemination, distribution or copying
of this communication
is strictly prohibited. If you have received this communication in
error, please notify
Credit One Bank immediately by replying to this message and delete the
original message. Thank you.
******************* E-mail non-disclosure ******************
The information contained in this e-mail message may be proprietary
and/or confidential, and
protected from disclosure. If the reader of this message is not the
intended recipient,
or an employee or agent responsible for delivering this message to the
intended recipient,
you are hereby notified that any dissemination, distribution or copying
of this communication
is strictly prohibited. If you have received this communication in
error, please notify
Credit One Bank immediately by replying to this message and delete the
original message. Thank you.
******************* E-mail non-disclosure ******************
The information contained in this e-mail message may be proprietary
and/or confidential, and
protected from disclosure. If the reader of this message is not the
intended recipient,
or an employee or agent responsible for delivering this message to the
intended recipient,
you are hereby notified that any dissemination, distribution or copying
of this communication
is strictly prohibited. If you have received this communication in
error, please notify
Credit One Bank immediately by replying to this message and delete the
original message. Thank you.
******************* E-mail non-disclosure ******************
The information contained in this e-mail message may be proprietary and/or confidential, and
protected from disclosure. If the reader of this message is not the intended recipient,
or an employee or agent responsible for delivering this message to the intended recipient,
you are hereby notified that any dissemination, distribution or copying of this communication
is strictly prohibited. If you have received this communication in error, please notify
Credit One Bank immediately by replying to this message and delete the original message. Thank you.