LISTSERV at the University of Georgia
Menubar Imagemap
Home Browse Manage Request Manuals Register
Previous messageNext messagePrevious in topicNext in topicPrevious by same authorNext by same authorPrevious page (March 2007, week 2)Back to main SAS-L pageJoin or leave SAS-L (or change settings)ReplyPost a new messageSearchProportional fontNon-proportional font
Date:   Tue, 13 Mar 2007 13:55:09 -0400
Reply-To:   Sigurd Hermansen <HERMANS1@WESTAT.COM>
Sender:   "SAS(r) Discussion" <SAS-L@LISTSERV.UGA.EDU>
From:   Sigurd Hermansen <HERMANS1@WESTAT.COM>
Subject:   Re: Tall & Skinny, Short & Wide, or what ..
Comments:   To: "Bernard S. Zimmerman" <bernie@homer.att.com>
In-Reply-To:   <45F6A86C.9090302@homer.att.com>
Content-Type:   text/plain; charset="us-ascii"

Bernard: It's the 'one record per customer' syndrome that distinguishes 'short and fat' from 'tall and thin'. I wrote a 'Think Thin ...' SUGI presentation on that topic about a decade ago; others have emphasized the 'principle of verticality' as a good rule.

What difference does it make? Organization of data affects thinking about data. A customer represents only one dimension of data. In the old days of data processing, an input device read eighty columns off a card into a buffer that made field values readily available to plug into arithmetic and logical expressions. Accessing data from another card had a very high 'cost'. The same constraints have carried over, with little justification, to spreadsheets and statistics.

The relational database model introduced notions such as 3rd Normal Form and deprecated the practice of stuffing values representing different dimensions of a data model in one row of differently related data items. These new technologies are only slowly infiltrating the field of data analysis. Some data warehousing advocates have taken the idea of smaller sets of related date items one step farther to 'fact tables' in which several columns serve as indexes to very few, and possibly only one, columns of observed values. Dramatic improvements in indexing methods, file scanning speed, and memory availability favor these tall and thin data structures.

For your specific application it appears that you have a temporal dimension. How important do you see the monthly time series of bill amounts? Are you looking at seasonality, trends over more than one year, or time to event questions? The profile variables sound like permanent attributes of a person. Do they change over time? Different combinations of 100 permanent attributes will likely correlate tightly and thus provide less and less additional information as the number of attributes increases, for example, from 8 to 9. Unless you have strong theoretical reasons to expect a relation among attributes and an outcome of interest, observed impact of patterns of attributes or changes in temporal values may have a better chance of arising by accident than as a consequent of a valid and reliable dependency of outcome on variables.

A typical data mining exercise would involve searching for quasi-functional dependencies of outcomes to personal and temporal attributes; that is, patterns of demographic, economic, or other observed characteristics of individuals that map predominately to one value of an outcome, perhaps mediated by changes over time in patterns of attributes and outcomes. For example, the rate of disconnects may be much higher for higher income males under twenty in California during July than in other months. A simple summary could show whether disconnect_type=f(income_level,gender,state,month), aka 'surfer dude effect', actually shows up in observed data:

select income_level,gender,state,month,disconnect_type,count(*) as count group by income_level,gender,state,month ;

If counts for a specific disconnect type far exceed the expected count for that disconnect_type, then the effect remains plausible. Ruling out false discoveries remains a challenge, nonetheless, so it may help to develop a parametric statistical model to estimate the relative influence of different variables on disconnects. At that point it may make sense to transpose a relation into a form that better fits a statistical modelling procedure: say, GLM, ARMAX, or MIXED.

I'd advise that you step back and consider what you are trying to do before committing to any one data structure. I've found that "tall and thin" structures tend to be easier to transform correctly into "short and fat" structures as needed. You are really looking for data organization that will help you think intelligently about how data can help you answer your questions. No reason to avoid exploring different possibilities. S

-----Original Message----- From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu] On Behalf Of Bernard S. Zimmerman Sent: Tuesday, March 13, 2007 9:35 AM To: SAS-L@LISTSERV.UGA.EDU Subject: Re: Tall & Skinny, Short & Wide, or what ..

What I was trying to say, not too well apparently, is the following .. the large file contains one record per customer.

Ultimately, I am after one record per customer/phone number, containing the 100 profile variables including driver file variables, plus monthly bill amounts, for example: amt_july, amt_aug, amt_sep, amt_oct, amt_nov & amt_dec. Seems like arrays come in handy here. If this involves transposing the data, perhaps, would you have any ideas how this might be achieved?

Thanks again,

Bernie Zimmerman bernie@homer.att.com 973.236.6718

Jack Clark wrote: > Bernard, > > Is the larger dataset one record per phone number or does each month's

> bill amount have its own observation? I'm trying to understand the > issue you described of having to string out bill amount - month 1 - > bill amount - month 12. > > What type of further processing do you have to do with the merged > file? > > > Jack Clark > Research Analyst > Center for Health Program Development and Management University of > Maryland, Baltimore County > > > > -----Original Message----- > From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of > Bernard S. Zimmerman > Sent: Tuesday, March 13, 2007 8:41 AM > To: SAS-L@LISTSERV.UGA.EDU > Subject: Tall & Skinny, Short & Wide, or what .. > > Listers: > > I come to you once again, in need of advice. > > Here is my problem (opportunity). We have a monthly file that > profiles our customers. It consists of about 7 - 8 million records > with 300+ variables. (So far so good). One of the variables is > called "disconnect indicator" which contains a code for why or if a > customer's service has been disconnected, in the past 12 months. In > addition, we have been given a 'driver' file of roughly 25K phone > numbers of customers, disconnected because they say the service costs > too much. The driver file contains a transaction disconnect date along

> with 5 or so other variables we'd like to keep. > > What we need to do is match the driver file to the monthly file by 10 > digit phone number and keep roughly 100 or so of the 300+ variables > mentioned earlier. So far so good except .. if/when I match either > with a data step MERGE or Proc SQL (where I could use lots of help) it

> seems I would end up with a short and wide dataset, which is supposed > to be bad for further processing. In addition, what we need to come > up with is a way to string out variables so that a record contains for

> example: phone number, var2, var3, var4, var5, var6 from the driver > file along with -- bill amount - month1, bill amount - month2, ... > bill amount month12, from the big dataset. > > Would someone be kind enough to share any ideas you might have, > pseudocode or code snippets to help me solve this assignment? > > Of course, many thanks in advance. > > Bernie Zimmerman > bernie@homer.att.com > 973.236.6718


Back to: Top of message | Previous page | Main SAS-L page