I do not know much about software for predictive analytics, but could offer just a twocents bit on some of the more general issues raised in your message:
Probabilities based on actuarial data in general, although often used to predict individual outcomes, are best understood in frequentist terms, i.e. as predicated on populations, rather than individuals. In other terms, when the outcome of a prediction is, say, a 0.6 probability of survival for 5 years after a heart attack, this simply means that 6 out of every 10 patients with such and such characteristics are predicted to survive for 5 years; the fate of any individual patient with such characteristics, say Mr John Smith, is radically indeterminate: he could die tomorrow or live for another 40 years; any individual outcome for Mr Smith is compatible with the prediction of a 0.6 probability of surviving 5 years, which refers to the entire group that includes Mr Smith and other similar patients, but not to any individual in particular.
Regarding variance explained, the matter is likewise. A particular model explains, say, 60% of the squared differences between individuals in a particular variable; the other 40% of variability is due to other factors alien to the model. The prediction produced by the model, explaining such proportion of total variance with a sample of given size, will have a confidence interval, which means that on average, say, 95% of individual cases will fall within the confidence interval (which in that case will be +/ 2 standard errors around the predicted value, the SE being a function of variance and sample size). A particular individual (our Mr Smith again) could possibly be inside or outside the confidence interval: nothing hinders him from being miles away from the predictive curve. If the model prediction for patients like Mr Smith is a life expectancy of 8 years, Mr Smith himself may ultimately survive for another 40 years, or just for a few hours: both are compatible with the
prediction. But if you have 1000 guys like Mr Smith, and your study is good, you may bet that about 95% of them will live for 8 years +/ 2 standard errors.
Now, all this is true of classical parametric analyses such as linear regression and related procedures. Predictive analytics use often nonparametric techniques like Cox survival models. Being non parametric, these procedures do not assume normally distributed errors, like linear regression does. However, some approximate measures of fit and statistical significance do exist even for those procedures, and stats software packages usually provide them. One key point to remember is that complex models with many predictors require large samples to do a proper job with small margins of error. This is doubly true for non parametric models, because their margins of error are unknown and probably larger. Many empirical studies of that kind are based on small samples, and therefore the results can easily contradict those of other similar studies, just by sample fluke.
A final word: predictive models do not try to EXPLAIN behavior, but to PREDICT it. There is a huge difference.
Hector
Original Message
From: SPSSX(r) Discussion [mailto:SPSSXL@LISTSERV.UGA.EDU] On Behalf Of Pirritano, Matthew
Sent: 22 September 2008 17:28
To: SPSSXL@LISTSERV.UGA.EDU
Subject: Opinions about validity of Predictive Analytics programs?
Hello spssers,
I know that SPSS has a predictive analytics module. I've also been
exposed to predictive analytic programs that make use of actuarial data
to predict risk in healthcare settings. What do statisticians think of
these models? Let me explain my own motivation for asking the question.
As an experimental psychologist I have seen a lot of research that tries
to explain human behavior. This research is often held up as exemplary
if it can explain 40 or 60 percent of the variance in that behavior. I
think 60 percent in many areas is probably all but unheard of. Now how
do these predictive analytic techniques describe the degree of explained
variance? I asked someone who works for a statistical package software
company and they told me that there was nothing akin to r squared in
these packages. Not to mention the fact that the back end (actual
calculations) of these techniques is not realistically understandable to
99% of the individuals that use them. So somehow statisticians have
developed these incredibly accurate ways of predicting future behaviors,
while the field of psychology plows on unawares of these successes?
Seems unlikely.
To me it just seems like software companies are playing into the myth
that statistics can magically tell you what you want to increase your
profits. They present 'testimonials', the last refuge of a scoundrel, to
support their claims. Is this not a case of absolute power leading to
absolute corruption?
All joking aside, does anyone have an opinion about this? As a lowly
peon I'm not sure if my opinion is valid or if I'm missing something
basic.
Thanks
Matt
Matthew Pirritano, Ph.D.
Research Analyst IV
Orange County Health Care Agency
(714) 8346011
=====================
To manage your subscription to SPSSXL, send a message to
LISTSERV@LISTSERV.UGA.EDU (not to SPSSXL), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSXL
For a list of commands to manage subscriptions, send the command
INFO REFCARD
=====================
To manage your subscription to SPSSXL, send a message to
LISTSERV@LISTSERV.UGA.EDU (not to SPSSXL), with no body text except the
command. To leave the list, send the command
SIGNOFF SPSSXL
For a list of commands to manage subscriptions, send the command
INFO REFCARD
