|
Bernard:
It's the 'one record per customer' syndrome that distinguishes 'short
and fat' from 'tall and thin'. I wrote a 'Think Thin ...' SUGI
presentation on that topic about a decade ago; others have emphasized
the 'principle of verticality' as a good rule.
What difference does it make? Organization of data affects thinking
about data. A customer represents only one dimension of data. In the old
days of data processing, an input device read eighty columns off a card
into a buffer that made field values readily available to plug into
arithmetic and logical expressions. Accessing data from another card had
a very high 'cost'. The same constraints have carried over, with little
justification, to spreadsheets and statistics.
The relational database model introduced notions such as 3rd Normal Form
and deprecated the practice of stuffing values representing different
dimensions of a data model in one row of differently related data items.
These new technologies are only slowly infiltrating the field of data
analysis. Some data warehousing advocates have taken the idea of smaller
sets of related date items one step farther to 'fact tables' in which
several columns serve as indexes to very few, and possibly only one,
columns of observed values. Dramatic improvements in indexing methods,
file scanning speed, and memory availability favor these tall and thin
data structures.
For your specific application it appears that you have a temporal
dimension. How important do you see the monthly time series of bill
amounts? Are you looking at seasonality, trends over more than one year,
or time to event questions? The profile variables sound like permanent
attributes of a person. Do they change over time? Different combinations
of 100 permanent attributes will likely correlate tightly and thus
provide less and less additional information as the number of attributes
increases, for example, from 8 to 9. Unless you have strong theoretical
reasons to expect a relation among attributes and an outcome of
interest, observed impact of patterns of attributes or changes in
temporal values may have a better chance of arising by accident than as
a consequent of a valid and reliable dependency of outcome on variables.
A typical data mining exercise would involve searching for
quasi-functional dependencies of outcomes to personal and temporal
attributes; that is, patterns of demographic, economic, or other
observed characteristics of individuals that map predominately to one
value of an outcome, perhaps mediated by changes over time in patterns
of attributes and outcomes. For example, the rate of disconnects may be
much higher for higher income males under twenty in California during
July than in other months. A simple summary could show whether
disconnect_type=f(income_level,gender,state,month), aka 'surfer dude
effect', actually shows up in observed data:
select income_level,gender,state,month,disconnect_type,count(*) as
count group by income_level,gender,state,month
;
If counts for a specific disconnect type far exceed the expected count
for that disconnect_type, then the effect remains plausible. Ruling out
false discoveries remains a challenge, nonetheless, so it may help to
develop a parametric statistical model to estimate the relative
influence of different variables on disconnects. At that point it may
make sense to transpose a relation into a form that better fits a
statistical modelling procedure: say, GLM, ARMAX, or MIXED.
I'd advise that you step back and consider what you are trying to do
before committing to any one data structure. I've found that "tall and
thin" structures tend to be easier to transform correctly into "short
and fat" structures as needed. You are really looking for data
organization that will help you think intelligently about how data can
help you answer your questions. No reason to avoid exploring different
possibilities.
S
-----Original Message-----
From: owner-sas-l@listserv.uga.edu [mailto:owner-sas-l@listserv.uga.edu]
On Behalf Of Bernard S. Zimmerman
Sent: Tuesday, March 13, 2007 9:35 AM
To: SAS-L@LISTSERV.UGA.EDU
Subject: Re: Tall & Skinny, Short & Wide, or what ..
What I was trying to say, not too well apparently, is the following ..
the large file contains one record per customer.
Ultimately, I am after one record per customer/phone number, containing
the 100 profile variables including driver file variables, plus monthly
bill amounts, for example: amt_july, amt_aug, amt_sep, amt_oct, amt_nov
& amt_dec. Seems like arrays come in handy here. If this involves
transposing the data, perhaps, would you have any ideas how this might
be achieved?
Thanks again,
Bernie Zimmerman
bernie@homer.att.com
973.236.6718
Jack Clark wrote:
> Bernard,
>
> Is the larger dataset one record per phone number or does each month's
> bill amount have its own observation? I'm trying to understand the
> issue you described of having to string out bill amount - month 1 -
> bill amount - month 12.
>
> What type of further processing do you have to do with the merged
> file?
>
>
> Jack Clark
> Research Analyst
> Center for Health Program Development and Management University of
> Maryland, Baltimore County
>
>
>
> -----Original Message-----
> From: SAS(r) Discussion [mailto:SAS-L@LISTSERV.UGA.EDU] On Behalf Of
> Bernard S. Zimmerman
> Sent: Tuesday, March 13, 2007 8:41 AM
> To: SAS-L@LISTSERV.UGA.EDU
> Subject: Tall & Skinny, Short & Wide, or what ..
>
> Listers:
>
> I come to you once again, in need of advice.
>
> Here is my problem (opportunity). We have a monthly file that
> profiles our customers. It consists of about 7 - 8 million records
> with 300+ variables. (So far so good). One of the variables is
> called "disconnect indicator" which contains a code for why or if a
> customer's service has been disconnected, in the past 12 months. In
> addition, we have been given a 'driver' file of roughly 25K phone
> numbers of customers, disconnected because they say the service costs
> too much. The driver file contains a transaction disconnect date along
> with 5 or so other variables we'd like to keep.
>
> What we need to do is match the driver file to the monthly file by 10
> digit phone number and keep roughly 100 or so of the 300+ variables
> mentioned earlier. So far so good except .. if/when I match either
> with a data step MERGE or Proc SQL (where I could use lots of help) it
> seems I would end up with a short and wide dataset, which is supposed
> to be bad for further processing. In addition, what we need to come
> up with is a way to string out variables so that a record contains for
> example: phone number, var2, var3, var4, var5, var6 from the driver
> file along with -- bill amount - month1, bill amount - month2, ...
> bill amount month12, from the big dataset.
>
> Would someone be kind enough to share any ideas you might have,
> pseudocode or code snippets to help me solve this assignment?
>
> Of course, many thanks in advance.
>
> Bernie Zimmerman
> bernie@homer.att.com
> 973.236.6718
|