Thursday, November 12, 2009
Statnotes collection from NCSU
For a nice applied collection of notes for multivariate analysis, see G. David Garvin's Statnotes. These notes are part of a graduate course, "Quantitative Research in Public Administration" at North Carolina State University (NCSU). Many of the entries include a FAQ section, as well as links to software and a list of references. All this on top of clear explanation of the principles for each topic.
Beware of bias-amplifying covariates!
While I am collecting pointers to good discussions of adjusting for non-random selection with weighting, there is a recent discussion about the inclusion of instrumental variables and the problems that they can cause. The post is here on Andrew Gelman's blog - the discussion that follows is very helpful as well.
Judea Pearl on IPW - a great find!
Judea Pearl's recent post on the intuition behind inverse probability weighting (IPW) is not likely to be one that I send out to non-statistical collaborators, but I did find it very useful for applied statisticians who want to understand current statistical thinking on the theoretical basis for IPW, and how that should guide selection of variables to include in a model for the probabilities in question.
For anyone who is working with observational studies, where selection to treatment is non-random; or with analysis of survey data, where response rates are a concern and non-response bias is possible, the technique of re-weighting the observed data to compensate for the observational design is worth a good look. Guidance on how to think about the models is much needed- and I believe that this post goes a long way towards providing that guidance! I hope to post an application note or two once I've made some headway on at least two projects where this will be useful.
Wednesday, October 21, 2009
Links for GLMs
Modeling beyond OLS
How do you know whether linear regression - or another modeling method - should be used to analyze data for a particular problem?
Here are a few things to consider: distribution of the outcome variable, the sampling design, the structure of the data, the questions to be answered or quantity to be estimated, and assumptions about key variables and/or error structures that can be made.
It gets complicated to say more without a specific example at hand. But for researchers who are familiar with linear and logistic regression and who encounter analysis problems where these methods are no longer appropriate, some general guidance is much needed. Software vendors often provide this kind of guidance. Consulting statisticians can also be helpful in providing guidance, but it is not always easy to identify suitable (non-statistical) references on the rationale for choosing methods beyond those that are well-known. I would definitely appreciate hearing about texts and articles on modeling methods that are targeted towards clinicians.
In the next few posts I hope to address the issue by exploring some of the questions that arise, with links to sources that I have found useful. Here is one that I have yet to explore in depth- it is from NC State's Statistics department, and could be a great resource: NC State Statnotes .
I'll be looking for feedback on this and other resources that I find, such as:
Here are a few things to consider: distribution of the outcome variable, the sampling design, the structure of the data, the questions to be answered or quantity to be estimated, and assumptions about key variables and/or error structures that can be made.
It gets complicated to say more without a specific example at hand. But for researchers who are familiar with linear and logistic regression and who encounter analysis problems where these methods are no longer appropriate, some general guidance is much needed. Software vendors often provide this kind of guidance. Consulting statisticians can also be helpful in providing guidance, but it is not always easy to identify suitable (non-statistical) references on the rationale for choosing methods beyond those that are well-known. I would definitely appreciate hearing about texts and articles on modeling methods that are targeted towards clinicians.
In the next few posts I hope to address the issue by exploring some of the questions that arise, with links to sources that I have found useful. Here is one that I have yet to explore in depth- it is from NC State's Statistics department, and could be a great resource: NC State Statnotes .
I'll be looking for feedback on this and other resources that I find, such as:
- Is the level of this resource suitable for medical or health policy researchers?
- Is the organization of the content appropriate?
- Are you able to find the answer to your question? Is the answer useful? Adequate to your needs?
- Is background knowledge needed in order to effectively use this resource?
Thursday, July 9, 2009
Statistical Learning Web Service Post
Wednesday, April 29, 2009
Visualizing Correlation Matrices in R
It is not easy to process a numeric representation of a correlation matrix that is more than 4x4, for most of us...in R, the "pairs" graph can be useful for up to about 10x10 matrices, but the R graph gallery has another option, here . This comes from Andrew Gelman's blog - where there are other examples, too (be sure to read the comments). In one case, the matrix format is abandoned - variables are arranged along the circumference of a circle - see here.
Some of the discussion has to do with using the arrangement of variables to guide exploratory analysis. In the matrix format, for example, by ordering variables by the expected strength of correlation, for example. Graphical 'testing' at the exploratory stage can save a lot of time and trouble, if it is not too time consuming itself!
Some of the discussion has to do with using the arrangement of variables to guide exploratory analysis. In the matrix format, for example, by ordering variables by the expected strength of correlation, for example. Graphical 'testing' at the exploratory stage can save a lot of time and trouble, if it is not too time consuming itself!
Tuesday, March 31, 2009
The Unreasonable Effectiveness of Data
I really liked this article - mostly because it supports a perspective that I am betting on: that statistical learning methods, with large enough datasets to train on, can outperform more conventional modeling approaches. The article above is really concerned with natural language processing - extracting meaning, or understanding, of human speech and writing, while my focus is on extracting "meaning" from electronic medical records data - for example, a patient's diagnosis codes and medical service use records. Might there be an analogy? For my problem, the "meaning" of interest might be encoded in the data as whether the patient gets better or not.
There is much discussion of the issues of semantic interpretation and semantic web, which this article really engages - but as this is my first taste of this discussion, I refer the interested reader to the posts that led me to the article:
See here and here for a great start - both authors have given a lot of thought to Semantic Web vision, issues, and the potential it represents. Their responses to the attitude expressed by Google strike me as spot on (in my necessarily humble opinion).
My focus is more on the approach to natural language processing - especially the approach of finding meaning by using vast amounts of "data in the wild", that is, natural language samples available on the web, as a training set - versus alternative approaches that are more structured.
It may seem like a stretch, but a big problem in health and medical outcomes research is to understand a patient's medical status, as captured imperfectly in a list of diagnosis codes and other administrative records, as it relates to their health outcomes. Is this like natural language processing, or translation? If you think of strings of co-occurring diagnoses as the "phrases", and health outcomes as the "translation" - or meaning, it is possible. Then, for a population of patients, the set of all diagnosis codes from all sources represents a kind of "data in the wild".
There is much discussion of the issues of semantic interpretation and semantic web, which this article really engages - but as this is my first taste of this discussion, I refer the interested reader to the posts that led me to the article:
See here and here for a great start - both authors have given a lot of thought to Semantic Web vision, issues, and the potential it represents. Their responses to the attitude expressed by Google strike me as spot on (in my necessarily humble opinion).
My focus is more on the approach to natural language processing - especially the approach of finding meaning by using vast amounts of "data in the wild", that is, natural language samples available on the web, as a training set - versus alternative approaches that are more structured.
It may seem like a stretch, but a big problem in health and medical outcomes research is to understand a patient's medical status, as captured imperfectly in a list of diagnosis codes and other administrative records, as it relates to their health outcomes. Is this like natural language processing, or translation? If you think of strings of co-occurring diagnoses as the "phrases", and health outcomes as the "translation" - or meaning, it is possible. Then, for a population of patients, the set of all diagnosis codes from all sources represents a kind of "data in the wild".
The message then could be that if we could aggregate all of the health care data by patient, we would have a similarly huge training set from which to extract meaning.
Returning to the article at hand, then, here is what we find:
Lessons learned
Here are the major signposts in the article - and the most valuable quotes, for my application:
Here is an excerpt that distinguishes between semantic interpretation goals and the goals of the semantic web - I found this very helpful (and only hope I did actually understand!):
Lessons learned
Here are the major signposts in the article - and the most valuable quotes, for my application:
The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available. For instance, we find that useful semantic relationships can be automatically learned from the statistics of search queries and the corresponding results or from the accumulated evidence of Web-based text patterns and formatted tables, in both cases without needing any manually annotated data.This leads me to wonder whether we, too, might find that the data in 3.6 million patient's records can tell us what is most predictive - vs. our trying to consolidate predictors or construct new indicators of a potentially worsening condition.
Another important lesson from statistical methods in speech recognition and machine translation is that memorization is a good policy if you have a lot of training data.That is, find the patterns that are in the data rather than deciding what is meaningful (more general patterns, for example) in advance and looking only for them.
Instead of assuming that general patterns are more effective than memorizing specific phrases, today’s translation models introduce general rules only when they improve translation over just memorizing particular phrases (for instance, in rules for dates and numbers). Similar observations have been made in every other application of machine learning to Web data: simple n-gram models or linear classifiers based on millions of specific features perform better than elaborate models that try to discover general rules.That's what I wanted to hear. I sure would love to see that list of "every other application of machine learning to Web data". Maybe I'll write and ask for it. But my experience using statistical learning approaches with large healthcare datasets has been been similar.
Here is an excerpt that distinguishes between semantic interpretation goals and the goals of the semantic web - I found this very helpful (and only hope I did actually understand!):
The problem of understanding human speech and writing—the semantic interpretation problem—is quite different from the problem of software service interoperability. Semantic interpretation deals with imprecise, ambiguous natural languages, whereas service interoperability deals with making data precise enough that the programs operating on the data will function effectively.Finally - advice that I agree with, so far:
So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail.... invariably,
simple models and a lot of data trump
more elaborate models based on less
data.
Tuesday, March 24, 2009
Graphical Assessment of Fit for Logistic Regression
This deserves a good read - it uses the breast cancer dataset and proposes using Bayesian Marginal Model Plots. Software for WinBUGS and R that implement this is available and a link is included.
Monday, March 2, 2009
Predictive Comparisons for Non-additive Models... or, Gelman on the Practical Virtues of Statistical Humility
See this paper from Sociological Methodology 2007, titled "Average predictive comparisons for models with nonlinearity, interactions, and variance components".
This paper addresses the issue of making "predictions" based on a model - but where causality is not established. A bonus, in my view, is that right up front, Gelman introduces a way of referring to the "predictions" that is at once descriptive and, hopefully, understandable by non-statisticians as well - without being misleading! The term "predictive comparison" is defined in his Eq. 1, and is said to "correspond to an expected causal effect under a counterfactual assumption(Neyman, 1923, Rubin, 1974, 1990), if it makes sense to consider the inputs causal". How much nicer and more clear it is to refer to this estimate as a "predictive comparison", in cases the inputs are not known to be causal. See his recent post for more on the choice of words to communicate results: "Describing descriptive studies using descriptive language, or the practical virtues of statistical humility".
His statistical contribution in the article is the carefully presented demonstration that in cases where the model is not linear and additive, model-based estimates of the average difference in the outcome variable, for two values of the covariate of interest, the "predictor", should be averaged over values of the remaining covariates as in his Eq. 2, (see Figure 2). After studying Figure 2, you will see how using point estimates for these covariates can get you into trouble. Implementing the improved methodology is a matter for another post...
This paper addresses the issue of making "predictions" based on a model - but where causality is not established. A bonus, in my view, is that right up front, Gelman introduces a way of referring to the "predictions" that is at once descriptive and, hopefully, understandable by non-statisticians as well - without being misleading! The term "predictive comparison" is defined in his Eq. 1, and is said to "correspond to an expected causal effect under a counterfactual assumption(Neyman, 1923, Rubin, 1974, 1990), if it makes sense to consider the inputs causal". How much nicer and more clear it is to refer to this estimate as a "predictive comparison", in cases the inputs are not known to be causal. See his recent post for more on the choice of words to communicate results: "Describing descriptive studies using descriptive language, or the practical virtues of statistical humility".
His statistical contribution in the article is the carefully presented demonstration that in cases where the model is not linear and additive, model-based estimates of the average difference in the outcome variable, for two values of the covariate of interest, the "predictor", should be averaged over values of the remaining covariates as in his Eq. 2, (see Figure 2). After studying Figure 2, you will see how using point estimates for these covariates can get you into trouble. Implementing the improved methodology is a matter for another post...
Friday, February 20, 2009
R Graph Gallery
This comes from Andrew Gelman's blog - a great collection of graphs with underlying R code:
http://addictedtor.free.fr/graphiques/thumbs.php?sort=votes
There is a graph showing cross-validation ROC plots that looks really useful - along with many others!
http://addictedtor.free.fr/graphiques/thumbs.php?sort=votes
There is a graph showing cross-validation ROC plots that looks really useful - along with many others!
Tuesday, February 10, 2009
Automated variable selection and model stability in logistic regression
Clinical researchers are often in the position of having a limited set of observations which they would like to use to determine factors that are predictive of a negative outcome. Logistic regression with automated model selection is commonly used for this purpose. In backward elimination, a "full model" containing all of the candidate predictors is used as a first guess. Variables are sequentailly eliminated from the model until a pre-specified stopping rule is satisfied. At each step, the variable whose removal would result in the smallest decrease in a summary measure of model fit (for example, the R-squared, or model deviance) is eliminated. A common stopping rule is that the remaining variables all satisfy an arbitrary cutoff in statistical significance.
The problem with this approach is that small changes in the data set can result in a completely different model being selected. This undermines the purpose of the modeling exercise entirely, since it means that the predictor set derived from one set of data may not be useful in predicting the outcome for a different set of observations.
In the 2004 paper by Peter Austin and Jack Tu at the University of Toronto, the difficulties with this approach are made clear in the study of models to predict acute myocardial infarction mortality. They draw 1,000 bootstrap samples from the dataset and fit a model using backward selection to each of the samples. They have a collection of twenty-nine candidate predictors. While they find three variables that are identified as independent predictors of mortality in all 1,000 of the boostrap samples, an amazing number, 18 out of the full 29, are selected in fewer than half of the bootstrap samples.
The article is published in the Journal of Clinical Epidemiology, 57 (2004) p. 1138-1146.
The problem with this approach is that small changes in the data set can result in a completely different model being selected. This undermines the purpose of the modeling exercise entirely, since it means that the predictor set derived from one set of data may not be useful in predicting the outcome for a different set of observations.
In the 2004 paper by Peter Austin and Jack Tu at the University of Toronto, the difficulties with this approach are made clear in the study of models to predict acute myocardial infarction mortality. They draw 1,000 bootstrap samples from the dataset and fit a model using backward selection to each of the samples. They have a collection of twenty-nine candidate predictors. While they find three variables that are identified as independent predictors of mortality in all 1,000 of the boostrap samples, an amazing number, 18 out of the full 29, are selected in fewer than half of the bootstrap samples.
The article is published in the Journal of Clinical Epidemiology, 57 (2004) p. 1138-1146.
Saturday, January 17, 2009
Regression Problems - a catalog
There is a nice chapter on Regression Problems in a work-in-progress guide aptly titled "Statistics with R", by Vincent Zoonekynd. It is really a collection of notes that the author wrote for himself, much like this blog! I find it very handy - despite the author's disclaimers, I think he has collected a great set of examples and generously made them available. His linked table of contents is inviting and useful. For the Regression Problems chapter, it includes:
Some plots to explore a regression
Overfit
Underfit
Influential points
Influential clusters
Non gaussian residuals
Heteroskedasticity
Correlated errors
Unidentifiability
Missing values
Extrapolation
Miscellaneous
The curse of dimension
Wide problems
His blog is not focused on statistics or on R, but I found reviews and examples of interesting graphics using R, for example, this one.
Some plots to explore a regression
Overfit
Underfit
Influential points
Influential clusters
Non gaussian residuals
Heteroskedasticity
Correlated errors
Unidentifiability
Missing values
Extrapolation
Miscellaneous
The curse of dimension
Wide problems
His blog is not focused on statistics or on R, but I found reviews and examples of interesting graphics using R, for example, this one.
Statistical Learning for Predictive Modeling
I have been applying boosted regression tree models in a study whose goal is to identify patients with chronic conditions who might benefit from additional care.
It is natural to ask, Why use an unfamiliar method like this when the standard logistic regression approach is more familiar and widely used? This entry is my first attempt at addressing this question in writing - my apologies, it is likely to falls far short of the mark! Eventually, I hope to "get it right" and be able to refer back to a clear exposition. For now, here goes:
First, it helps to know more about the prediction problem at hand. An expert medical/health services team has assembled a dataset consisting of available diagnostic and health care indicators for a population of over 3.5 million patients. There are over 450 variables in this set. The event to be predicted is hospitalization or death that is "potentially preventable" - so, for example, hospital admissions for childbirth, joint replacement, etc., and deaths due to automobile accidents and the like are removed.
A standard approach to this problem would be an additive logistic regression model, possibly testing for inclusion of interaction terms. With an excess of 3.5 million observations, there is plenty of data to use for model-fitting and cross-validation - even with 450 predictors.
But there bigger challenge to consider: with this many potentially strong predictors of a patient's health, how do we address the selection of interactions between predictors - cases where the presence of two indicators simultaneously has an impact on risk that is far greater than the sum of their individual contributions? Common sense and the medical literature tell us that health hazards escalate rapidly with age and with certain combinations of conditions and characteristics. It is a daunting task to consider exploring the 'space of all possible models' to select the single 'best' model for this prediction problem.
But from a statistical learning perspective, this wealth of potential predictors is ideal. The approach taken by these methods is to identify predictive trends from the data itself, to construct a model using the data - rather than "fit" a predetermined model structure to the dataset. Without going into details of how this is done here (this will follow in another post), it is intriguing to see if a statistical learning method might have benefits over the standard approach for this modeling problem. Can a method that learns by trial and error from the data outperform the classic statistical approach, for a prediction problem like this?
The jury is still out in the statistical and business intelligence community - there is a helpful discussion of traditional statistical (TS) versus statistical or machine learning (ML) approaches at DMReview.com in an article titled "Statistical Learning for BI, Part 1".
And this concludes my initial motivating argument for trying a statistical learning approach to prediction for this problem: it solves the problem of model specification, in that it builds its own model from the data. How is this accomplished? Check back - I will do my best to explain this in my next post.
It is natural to ask, Why use an unfamiliar method like this when the standard logistic regression approach is more familiar and widely used? This entry is my first attempt at addressing this question in writing - my apologies, it is likely to falls far short of the mark! Eventually, I hope to "get it right" and be able to refer back to a clear exposition. For now, here goes:
First, it helps to know more about the prediction problem at hand. An expert medical/health services team has assembled a dataset consisting of available diagnostic and health care indicators for a population of over 3.5 million patients. There are over 450 variables in this set. The event to be predicted is hospitalization or death that is "potentially preventable" - so, for example, hospital admissions for childbirth, joint replacement, etc., and deaths due to automobile accidents and the like are removed.
A standard approach to this problem would be an additive logistic regression model, possibly testing for inclusion of interaction terms. With an excess of 3.5 million observations, there is plenty of data to use for model-fitting and cross-validation - even with 450 predictors.
But there bigger challenge to consider: with this many potentially strong predictors of a patient's health, how do we address the selection of interactions between predictors - cases where the presence of two indicators simultaneously has an impact on risk that is far greater than the sum of their individual contributions? Common sense and the medical literature tell us that health hazards escalate rapidly with age and with certain combinations of conditions and characteristics. It is a daunting task to consider exploring the 'space of all possible models' to select the single 'best' model for this prediction problem.
But from a statistical learning perspective, this wealth of potential predictors is ideal. The approach taken by these methods is to identify predictive trends from the data itself, to construct a model using the data - rather than "fit" a predetermined model structure to the dataset. Without going into details of how this is done here (this will follow in another post), it is intriguing to see if a statistical learning method might have benefits over the standard approach for this modeling problem. Can a method that learns by trial and error from the data outperform the classic statistical approach, for a prediction problem like this?
The jury is still out in the statistical and business intelligence community - there is a helpful discussion of traditional statistical (TS) versus statistical or machine learning (ML) approaches at DMReview.com in an article titled "Statistical Learning for BI, Part 1".
And this concludes my initial motivating argument for trying a statistical learning approach to prediction for this problem: it solves the problem of model specification, in that it builds its own model from the data. How is this accomplished? Check back - I will do my best to explain this in my next post.
Monday, January 5, 2009
A simple SAS proc tabulate example
PROC TABULATE is convenient for simple tables, so I wanted to post the code and the reference from SAS's support site. This article was simple enough for me to quickly construct a useful table.
Here is an example using my own data:
/* The example below illustrates a very simple table */
/* First the dataset */
data test;
input age_grp $ census_region $ income $ beta_only ;
datalines;
65-74 South low 1
65-74 South medium 0
85+ South low 0
65-74 South high 1
85+ West low 0
75-84 South medium 0
85+ South low 0
75-84 South high 0
85+ South low 0
85+ Midwest low 0
65-74 Midwest medium 0
65-74 West high 0
65-74 South low 0
75-84 West low 1
75-84 South low 0
85+ Midwest medium 0
75-84 South low 0
75-84 West low 1
75-84 Northeast medium 0
75-84 Northeast low 1
75-84 South medium 0
75-84 Northeast medium 1
75-84 South medium 1
75-84 Midwest medium 1
75-84 West medium 1
85+ South medium 1
65-74 West high 1
75-84 Midwest low 1
75-84 Midwest medium 0
75-84 South low 0
85+ Midwest medium 0
75-84 West low 0
75-84 South low 0
65-74 Northeast medium 1
85+ Midwest medium 0
65-74 South medium 1
75-84 Northeast medium 0
85+ Northeast medium 1
65-74 South high 0
65-74 West medium 0
85+ Midwest high 0
;
run;
/* Set up variables and a title */
%let my_cvars = age_grp census_region income;
%let trt = beta_only; /* Indicator for treatment */
title "Characteristics by treatment status for &trt";
/* Simple table */
proc tabulate data=test;
/* List categorical variables in a class statement */
class &trt &my_cvars;
/* Use keylabel to construct labels for proc tabulate tags */
keylabel n = 'N' all='Total' pctn='%' colpctn= 'Col %' rowpctn='Row %' ;
/* Define the table as: table rows, columns; */
table (&my_cvars all)
,
&trt*(n*f=comma6. pctn*f=comma6.1 colpctn*f=7.1 rowpctn*f=7.1)
;
run;
Here is an example using my own data:
/* The example below illustrates a very simple table */
/* First the dataset */
data test;
input age_grp $ census_region $ income $ beta_only ;
datalines;
65-74 South low 1
65-74 South medium 0
85+ South low 0
65-74 South high 1
85+ West low 0
75-84 South medium 0
85+ South low 0
75-84 South high 0
85+ South low 0
85+ Midwest low 0
65-74 Midwest medium 0
65-74 West high 0
65-74 South low 0
75-84 West low 1
75-84 South low 0
85+ Midwest medium 0
75-84 South low 0
75-84 West low 1
75-84 Northeast medium 0
75-84 Northeast low 1
75-84 South medium 0
75-84 Northeast medium 1
75-84 South medium 1
75-84 Midwest medium 1
75-84 West medium 1
85+ South medium 1
65-74 West high 1
75-84 Midwest low 1
75-84 Midwest medium 0
75-84 South low 0
85+ Midwest medium 0
75-84 West low 0
75-84 South low 0
65-74 Northeast medium 1
85+ Midwest medium 0
65-74 South medium 1
75-84 Northeast medium 0
85+ Northeast medium 1
65-74 South high 0
65-74 West medium 0
85+ Midwest high 0
;
run;
/* Set up variables and a title */
%let my_cvars = age_grp census_region income;
%let trt = beta_only; /* Indicator for treatment */
title "Characteristics by treatment status for &trt";
/* Simple table */
proc tabulate data=test;
/* List categorical variables in a class statement */
class &trt &my_cvars;
/* Use keylabel to construct labels for proc tabulate tags */
keylabel n = 'N' all='Total' pctn='%' colpctn= 'Col %' rowpctn='Row %' ;
/* Define the table as: table rows, columns; */
table (&my_cvars all)
,
&trt*(n*f=comma6. pctn*f=comma6.1 colpctn*f=7.1 rowpctn*f=7.1)
;
run;
Subscribe to:
Comments (Atom)