Saturday, January 17, 2009

Regression Problems - a catalog

There is a nice chapter on Regression Problems in a work-in-progress guide aptly titled "Statistics with R", by Vincent Zoonekynd. It is really a collection of notes that the author wrote for himself, much like this blog! I find it very handy - despite the author's disclaimers, I think he has collected a great set of examples and generously made them available. His linked table of contents is inviting and useful. For the Regression Problems chapter, it includes:

Some plots to explore a regression
Overfit
Underfit
Influential points
Influential clusters
Non gaussian residuals
Heteroskedasticity
Correlated errors
Unidentifiability
Missing values
Extrapolation
Miscellaneous
The curse of dimension
Wide problems



His blog is not focused on statistics or on R, but I found reviews and examples of interesting graphics using R, for example, this one.

Statistical Learning for Predictive Modeling

I have been applying boosted regression tree models in a study whose goal is to identify patients with chronic conditions who might benefit from additional care.

It is natural to ask, Why use an unfamiliar method like this when the standard logistic regression approach is more familiar and widely used? This entry is my first attempt at addressing this question in writing - my apologies, it is likely to falls far short of the mark! Eventually, I hope to "get it right" and be able to refer back to a clear exposition. For now, here goes:

First, it helps to know more about the prediction problem at hand. An expert medical/health services team has assembled a dataset consisting of available diagnostic and health care indicators for a population of over 3.5 million patients. There are over 450 variables in this set. The event to be predicted is hospitalization or death that is "potentially preventable" - so, for example, hospital admissions for childbirth, joint replacement, etc., and deaths due to automobile accidents and the like are removed.

A standard approach to this problem would be an additive logistic regression model, possibly testing for inclusion of interaction terms. With an excess of 3.5 million observations, there is plenty of data to use for model-fitting and cross-validation - even with 450 predictors.

But there bigger challenge to consider: with this many potentially strong predictors of a patient's health, how do we address the selection of interactions between predictors - cases where the presence of two indicators simultaneously has an impact on risk that is far greater than the sum of their individual contributions? Common sense and the medical literature tell us that health hazards escalate rapidly with age and with certain combinations of conditions and characteristics. It is a daunting task to consider exploring the 'space of all possible models' to select the single 'best' model for this prediction problem.

But from a statistical learning perspective, this wealth of potential predictors is ideal. The approach taken by these methods is to identify predictive trends from the data itself, to construct a model using the data - rather than "fit" a predetermined model structure to the dataset. Without going into details of how this is done here (this will follow in another post), it is intriguing to see if a statistical learning method might have benefits over the standard approach for this modeling problem. Can a method that learns by trial and error from the data outperform the classic statistical approach, for a prediction problem like this?

The jury is still out in the statistical and business intelligence community - there is a helpful discussion of traditional statistical (TS) versus statistical or machine learning (ML) approaches at DMReview.com in an article titled "Statistical Learning for BI, Part 1".

And this concludes my initial motivating argument for trying a statistical learning approach to prediction for this problem: it solves the problem of model specification, in that it builds its own model from the data. How is this accomplished? Check back - I will do my best to explain this in my next post.

Monday, January 5, 2009

A simple SAS proc tabulate example

PROC TABULATE is convenient for simple tables, so I wanted to post the code and the reference from SAS's support site. This article was simple enough for me to quickly construct a useful table.

Here is an example using my own data:

/* The example below illustrates a very simple table */

/* First the dataset */
data test;
input age_grp $ census_region $ income $ beta_only ;
datalines;
65-74 South low 1
65-74 South medium 0
85+ South low 0
65-74 South high 1
85+ West low 0
75-84 South medium 0
85+ South low 0
75-84 South high 0
85+ South low 0
85+ Midwest low 0
65-74 Midwest medium 0
65-74 West high 0
65-74 South low 0
75-84 West low 1
75-84 South low 0
85+ Midwest medium 0
75-84 South low 0
75-84 West low 1
75-84 Northeast medium 0
75-84 Northeast low 1
75-84 South medium 0
75-84 Northeast medium 1
75-84 South medium 1
75-84 Midwest medium 1
75-84 West medium 1
85+ South medium 1
65-74 West high 1
75-84 Midwest low 1
75-84 Midwest medium 0
75-84 South low 0
85+ Midwest medium 0
75-84 West low 0
75-84 South low 0
65-74 Northeast medium 1
85+ Midwest medium 0
65-74 South medium 1
75-84 Northeast medium 0
85+ Northeast medium 1
65-74 South high 0
65-74 West medium 0
85+ Midwest high 0
;
run;

/* Set up variables and a title */

%let my_cvars = age_grp census_region income;
%let trt = beta_only; /* Indicator for treatment */
title "Characteristics by treatment status for &trt";

/* Simple table */
proc tabulate data=test;

/* List categorical variables in a class statement */
class &trt &my_cvars;

/* Use keylabel to construct labels for proc tabulate tags */
keylabel n = 'N' all='Total' pctn='%' colpctn= 'Col %' rowpctn='Row %' ;

/* Define the table as: table rows, columns; */

table (&my_cvars all)
,
&trt*(n*f=comma6. pctn*f=comma6.1 colpctn*f=7.1 rowpctn*f=7.1)
;

run;