This comes from Andrew Gelman's blog - a great collection of graphs with underlying R code:
http://addictedtor.free.fr/graphiques/thumbs.php?sort=votes
There is a graph showing cross-validation ROC plots that looks really useful - along with many others!
Friday, February 20, 2009
Tuesday, February 10, 2009
Automated variable selection and model stability in logistic regression
Clinical researchers are often in the position of having a limited set of observations which they would like to use to determine factors that are predictive of a negative outcome. Logistic regression with automated model selection is commonly used for this purpose. In backward elimination, a "full model" containing all of the candidate predictors is used as a first guess. Variables are sequentailly eliminated from the model until a pre-specified stopping rule is satisfied. At each step, the variable whose removal would result in the smallest decrease in a summary measure of model fit (for example, the R-squared, or model deviance) is eliminated. A common stopping rule is that the remaining variables all satisfy an arbitrary cutoff in statistical significance.
The problem with this approach is that small changes in the data set can result in a completely different model being selected. This undermines the purpose of the modeling exercise entirely, since it means that the predictor set derived from one set of data may not be useful in predicting the outcome for a different set of observations.
In the 2004 paper by Peter Austin and Jack Tu at the University of Toronto, the difficulties with this approach are made clear in the study of models to predict acute myocardial infarction mortality. They draw 1,000 bootstrap samples from the dataset and fit a model using backward selection to each of the samples. They have a collection of twenty-nine candidate predictors. While they find three variables that are identified as independent predictors of mortality in all 1,000 of the boostrap samples, an amazing number, 18 out of the full 29, are selected in fewer than half of the bootstrap samples.
The article is published in the Journal of Clinical Epidemiology, 57 (2004) p. 1138-1146.
The problem with this approach is that small changes in the data set can result in a completely different model being selected. This undermines the purpose of the modeling exercise entirely, since it means that the predictor set derived from one set of data may not be useful in predicting the outcome for a different set of observations.
In the 2004 paper by Peter Austin and Jack Tu at the University of Toronto, the difficulties with this approach are made clear in the study of models to predict acute myocardial infarction mortality. They draw 1,000 bootstrap samples from the dataset and fit a model using backward selection to each of the samples. They have a collection of twenty-nine candidate predictors. While they find three variables that are identified as independent predictors of mortality in all 1,000 of the boostrap samples, an amazing number, 18 out of the full 29, are selected in fewer than half of the bootstrap samples.
The article is published in the Journal of Clinical Epidemiology, 57 (2004) p. 1138-1146.
Subscribe to:
Comments (Atom)