LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (September 2010, week 1)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:         Wed, 1 Sep 2010 13:30:05 -0700
Reply-To:     Alex Tang <Alex.Tang@CREDITONE.COM>
Sender:       "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:         Alex Tang <Alex.Tang@CREDITONE.COM>
Subject:      Re: interpreting clusters
Comments: To: "Thevenet-Morrison, Kelly"
          <Kelly_Thevenet-morrison@URMC.Rochester.edu>
In-Reply-To:  <0D6BED2F7A98414697EA36B4EC1E3C1601601E9C06@URMCMS9.urmc-sh.rochester.edu>
Content-Type: text/plain; charset="us-ascii"

I don't have SAS EM. I have the SPSS AnswerTree at hand. I have been doing some customer segmentation projects with it and it's pretty easy to work with. So, if there is not some kind of fundamental flaw behind this CART analysis method, I might just give it a shot.

-----Original Message----- From: Thevenet-Morrison, Kelly [mailto:Kelly_Thevenet-morrison@URMC.Rochester.edu] Sent: Wednesday, September 01, 2010 12:14 PM To: Alex Tang Subject: RE: interpreting clusters Sensitivity: Confidential

I was thinking about CART or CHAID, but I wasn't sure what program you had. If you have enterprise miner, you could easily create a decision tree with your segments.

-----Original Message----- From: Alex Tang [mailto:Alex.Tang@creditone.com] Sent: Wednesday, September 01, 2010 3:05 PM To: Thevenet-Morrison, Kelly Cc: SAS-L@LISTSERV.UGA.EDU Subject: RE: interpreting clusters Sensitivity: Confidential

Kelly, it's very good point to take care of missing data and outliers in your reply. I think it depends on the nature and scale of missing data how I should work on it. If it's rare to see missing data, I might just end up deleting them to make thing easy. Otherwise either getting a filler for them or make a dummy indicator for missing data would be the way to go.

To profile clusters, besides checking each variable mean/range, how would you think employ some simple CART analysis by using cluster # as DV and profile variables as IV? When I see some node in the decision tree contains majority of a cluster and dominate that node, this could give an idea how the customers look like and distinguish them from others. I have never done that before, so this may sound a bit silly or contains big flaw...

Thanks, Alex

-----Original Message----- From: Thevenet-Morrison, Kelly [mailto:Kelly_Thevenet-morrison@URMC.Rochester.edu] Sent: Wednesday, September 01, 2010 11:15 AM To: Alex Tang Subject: RE: interpreting clusters Sensitivity: Confidential

You could start with a smaller subset of variables that you think are important and build from there. Can you combine some of the variables that are binary? Do you have any missing data? I believe that you may need to make a filler for the missing otherwise the observation will be deleted. What you could do is build a frequency program in order to create your profile descriptors. So set up a format for age range, income range, credit history, etc.

Proc format ; Value ager 25-35='25-35' 36-45='36-45' Etc

Same for income And your other variables;

Run a proc freq using your formats by your segment and output it to a dataset or to excel. It is kind of like a quick snapshot to see how different the distributions are from each other. This would provide you an easier way to create cutoffs.

You could also use proc univariate with the outtable option. Using segment as your class variable. This will show distributions for continuous. What is nice about the proc freq is that you use it for your categorical variables as well as continuous and you can see what the differences are across the segments.

How many observations do you have? If a segment pops out that is relatively small, check for outliers in your data.

Hope that helps.

K

-----Original Message----- From: Alex Tang [mailto:Alex.Tang@creditone.com] Sent: Wednesday, September 01, 2010 1:58 PM To: Thevenet-Morrison, Kelly Cc: SAS-L@LISTSERV.UGA.EDU Subject: RE: interpreting clusters Sensitivity: Confidential

Kelly, thanks for your input. I am also thinking of this kind of manual check and comparison.

This way, we will have to pull the mean (and maybe also the range?) of all the possible/available variable for each cluster and compare across all the clusters, right?

Now the challenge is to decide the combination of variable (and cutoff point) to use for defining the clusters. Say, when I see cluster A have a average age of 55, and all other clusters average 38, I would think age could be one of variables to define cluster A. But what age should I use as the cutoff value for cluster A? I think it could be anywhere between 38 to 55, isn't it? Besides age, I will need to continue other variable to see a possible cut to distinguish cluster A from others, right? When it comes to multiple variables, is there anything I should watch out for?

-----Original Message----- From: Thevenet-Morrison, Kelly [mailto:Kelly_Thevenet-morrison@URMC.Rochester.edu] Sent: Wednesday, September 01, 2010 10:47 AM To: Alex Tang Subject: RE: interpreting clusters Sensitivity: Confidential

If there is an output statement in fastclus to append the clusters to your data you could create quick profiles to test differences for your continuous variables using proc means with a class statement - your segment or cluster number would be your class variable. See how different they are. In the past I started with that and combined clusters if they were not that different from one another.

Kelly

Kelly Thevenet-Morrison MS Lead Programmer Analyst Department of Community and Preventive Medicine University of Rochester School of Medicine and Dentistry 601 Elmwood Ave., box 644 Rochester, NY 14642 Phone: 585-275-1817 e-mail: kelly_thevenet-morrison@urmc.rochester.edu

-----Original Message----- From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of Alex Tang Sent: Wednesday, September 01, 2010 1:37 PM To: SAS-L@LISTSERV.UGA.EDU Subject: interpreting clusters Sensitivity: Confidential

We are segmenting our customers based on some profiles. Since we don't have an explicit target right now, we are decided to do cluster analysis at the time. There are about 2MM customers with 50-100 variables.

I understand for such a big data set, I should have do PROC FASTCLUS to get a relative big number of preliminary cluster set first, say, 100, then import them to PROC CLUSTER for further analysis. If necessary, a factor analysis or principal component analysis is deemed in the front as well.

When I am confused here is, suppose I get a final set of clusters here, how do I interpret the clusters? It would be nice if I can describe the clusters based on the profiles of the customers. E.g. cluster A is the customers older than 40 years old and having an annual income less than 50k... something like that

Or for the interpretation purpose, I should refer to approach other than cluster analysis? Either, please advise. Thank you.

******************* E-mail non-disclosure ******************

The information contained in this e-mail message may be proprietary and/or confidential, and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify Credit One Bank immediately by replying to this message and delete the original message. Thank you.

******************* E-mail non-disclosure ******************

The information contained in this e-mail message may be proprietary and/or confidential, and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify Credit One Bank immediately by replying to this message and delete the original message. Thank you.

******************* E-mail non-disclosure ******************

The information contained in this e-mail message may be proprietary and/or confidential, and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify Credit One Bank immediately by replying to this message and delete the original message. Thank you.

******************* E-mail non-disclosure ******************

The information contained in this e-mail message may be proprietary and/or confidential, and protected from disclosure. If the reader of this message is not the intended recipient, or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please notify Credit One Bank immediately by replying to this message and delete the original message. Thank you.


Back to: Top of message | Previous page | Main SAS-L page