Tuesday, March 31, 2009

The Unreasonable Effectiveness of Data

I really liked this article - mostly because it supports a perspective that I am betting on: that statistical learning methods, with large enough datasets to train on, can outperform more conventional modeling approaches. The article above is really concerned with natural language processing - extracting meaning, or understanding, of human speech and writing, while my focus is on extracting "meaning" from electronic medical records data - for example, a patient's diagnosis codes and medical service use records. Might there be an analogy? For my problem, the "meaning" of interest might be encoded in the data as whether the patient gets better or not.

There is much discussion of the issues of semantic interpretation and semantic web, which this article really engages - but as this is my first taste of this discussion, I refer the interested reader to the posts that led me to the article:
See here and here for a great start - both authors have given a lot of thought to Semantic Web vision, issues, and the potential it represents. Their responses to the attitude expressed by Google strike me as spot on (in my necessarily humble opinion).

My focus is more on the approach to natural language processing - especially the approach of finding meaning by using vast amounts of "data in the wild", that is, natural language samples available on the web, as a training set - versus alternative approaches that are more structured.

It may seem like a stretch, but a big problem in health and medical outcomes research is to understand a patient's medical status, as captured imperfectly in a list of diagnosis codes and other administrative records, as it relates to their health outcomes. Is this like natural language processing, or translation? If you think of strings of co-occurring diagnoses as the "phrases", and health outcomes as the "translation" - or meaning, it is possible. Then, for a population of patients, the set of all diagnosis codes from all sources represents a kind of "data in the wild".

The message then could be that if we could aggregate all of the health care data by patient, we would have a similarly huge training set from which to extract meaning.

Returning to the article at hand, then, here is what we find:

Lessons learned

Here are the major signposts in the article - and the most valuable quotes, for my application:
The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available. For instance, we find that useful semantic relationships can be automatically learned from the statistics of search queries and the corresponding results or from the accumulated evidence of Web-based text patterns and formatted tables, in both cases without needing any manually annotated data.
This leads me to wonder whether we, too, might find that the data in 3.6 million patient's records can tell us what is most predictive - vs. our trying to consolidate predictors or construct new indicators of a potentially worsening condition.
Another important lesson from statistical methods in speech recognition and machine translation is that memorization is a good policy if you have a lot of training data.
That is, find the patterns that are in the data rather than deciding what is meaningful (more general patterns, for example) in advance and looking only for them.
Instead of assuming that general patterns are more effective than memorizing specific phrases, today’s translation models introduce general rules only when they improve translation over just memorizing particular phrases (for instance, in rules for dates and numbers). Similar observations have been made in every other application of machine learning to Web data: simple n-gram models or linear classifiers based on millions of specific features perform better than elaborate models that try to discover general rules.
That's what I wanted to hear. I sure would love to see that list of "every other application of machine learning to Web data". Maybe I'll write and ask for it. But my experience using statistical learning approaches with large healthcare datasets has been been similar.

Here is an excerpt that distinguishes between semantic interpretation goals and the goals of the semantic web - I found this very helpful (and only hope I did actually understand!):
The problem of understanding human speech and writing—the semantic interpretation problem—is quite different from the problem of software service interoperability. Semantic interpretation deals with imprecise, ambiguous natural languages, whereas service interoperability deals with making data precise enough that the programs operating on the data will function effectively.
Finally - advice that I agree with, so far:
So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail.
... invariably,
simple models and a lot of data trump
more elaborate models based on less
data.

Tuesday, March 24, 2009

Graphical Assessment of Fit for Logistic Regression

This deserves a good read - it uses the breast cancer dataset and proposes using Bayesian Marginal Model Plots. Software for WinBUGS and R that implement this is available and a link is included.

Monday, March 2, 2009

Predictive Comparisons for Non-additive Models... or, Gelman on the Practical Virtues of Statistical Humility

See this paper from Sociological Methodology 2007, titled "Average predictive comparisons for models with nonlinearity, interactions, and variance components".

This paper addresses the issue of making "predictions" based on a model - but where causality is not established. A bonus, in my view, is that right up front, Gelman introduces a way of referring to the "predictions" that is at once descriptive and, hopefully, understandable by non-statisticians as well - without being misleading! The term "predictive comparison" is defined in his Eq. 1, and is said to "correspond to an expected causal effect under a counterfactual assumption(Neyman, 1923, Rubin, 1974, 1990), if it makes sense to consider the inputs causal". How much nicer and more clear it is to refer to this estimate as a "predictive comparison", in cases the inputs are not known to be causal. See his recent post for more on the choice of words to communicate results: "Describing descriptive studies using descriptive language, or the practical virtues of statistical humility".

His statistical contribution in the article is the carefully presented demonstration that in cases where the model is not linear and additive, model-based estimates of the average difference in the outcome variable, for two values of the covariate of interest, the "predictor", should be averaged over values of the remaining covariates as in his Eq. 2, (see Figure 2). After studying Figure 2, you will see how using point estimates for these covariates can get you into trouble. Implementing the improved methodology is a matter for another post...