Saturday, January 17, 2009

Statistical Learning for Predictive Modeling

I have been applying boosted regression tree models in a study whose goal is to identify patients with chronic conditions who might benefit from additional care.

It is natural to ask, Why use an unfamiliar method like this when the standard logistic regression approach is more familiar and widely used? This entry is my first attempt at addressing this question in writing - my apologies, it is likely to falls far short of the mark! Eventually, I hope to "get it right" and be able to refer back to a clear exposition. For now, here goes:

First, it helps to know more about the prediction problem at hand. An expert medical/health services team has assembled a dataset consisting of available diagnostic and health care indicators for a population of over 3.5 million patients. There are over 450 variables in this set. The event to be predicted is hospitalization or death that is "potentially preventable" - so, for example, hospital admissions for childbirth, joint replacement, etc., and deaths due to automobile accidents and the like are removed.

A standard approach to this problem would be an additive logistic regression model, possibly testing for inclusion of interaction terms. With an excess of 3.5 million observations, there is plenty of data to use for model-fitting and cross-validation - even with 450 predictors.

But there bigger challenge to consider: with this many potentially strong predictors of a patient's health, how do we address the selection of interactions between predictors - cases where the presence of two indicators simultaneously has an impact on risk that is far greater than the sum of their individual contributions? Common sense and the medical literature tell us that health hazards escalate rapidly with age and with certain combinations of conditions and characteristics. It is a daunting task to consider exploring the 'space of all possible models' to select the single 'best' model for this prediction problem.

But from a statistical learning perspective, this wealth of potential predictors is ideal. The approach taken by these methods is to identify predictive trends from the data itself, to construct a model using the data - rather than "fit" a predetermined model structure to the dataset. Without going into details of how this is done here (this will follow in another post), it is intriguing to see if a statistical learning method might have benefits over the standard approach for this modeling problem. Can a method that learns by trial and error from the data outperform the classic statistical approach, for a prediction problem like this?

The jury is still out in the statistical and business intelligence community - there is a helpful discussion of traditional statistical (TS) versus statistical or machine learning (ML) approaches at DMReview.com in an article titled "Statistical Learning for BI, Part 1".

And this concludes my initial motivating argument for trying a statistical learning approach to prediction for this problem: it solves the problem of model specification, in that it builds its own model from the data. How is this accomplished? Check back - I will do my best to explain this in my next post.

No comments:

Post a Comment