There is much discussion of the issues of semantic interpretation and semantic web, which this article really engages - but as this is my first taste of this discussion, I refer the interested reader to the posts that led me to the article:
See here and here for a great start - both authors have given a lot of thought to Semantic Web vision, issues, and the potential it represents. Their responses to the attitude expressed by Google strike me as spot on (in my necessarily humble opinion).
My focus is more on the approach to natural language processing - especially the approach of finding meaning by using vast amounts of "data in the wild", that is, natural language samples available on the web, as a training set - versus alternative approaches that are more structured.
It may seem like a stretch, but a big problem in health and medical outcomes research is to understand a patient's medical status, as captured imperfectly in a list of diagnosis codes and other administrative records, as it relates to their health outcomes. Is this like natural language processing, or translation? If you think of strings of co-occurring diagnoses as the "phrases", and health outcomes as the "translation" - or meaning, it is possible. Then, for a population of patients, the set of all diagnosis codes from all sources represents a kind of "data in the wild".
The message then could be that if we could aggregate all of the health care data by patient, we would have a similarly huge training set from which to extract meaning.
Returning to the article at hand, then, here is what we find:
Lessons learned
Here are the major signposts in the article - and the most valuable quotes, for my application:
Here is an excerpt that distinguishes between semantic interpretation goals and the goals of the semantic web - I found this very helpful (and only hope I did actually understand!):
Lessons learned
Here are the major signposts in the article - and the most valuable quotes, for my application:
The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn’t available. For instance, we find that useful semantic relationships can be automatically learned from the statistics of search queries and the corresponding results or from the accumulated evidence of Web-based text patterns and formatted tables, in both cases without needing any manually annotated data.This leads me to wonder whether we, too, might find that the data in 3.6 million patient's records can tell us what is most predictive - vs. our trying to consolidate predictors or construct new indicators of a potentially worsening condition.
Another important lesson from statistical methods in speech recognition and machine translation is that memorization is a good policy if you have a lot of training data.That is, find the patterns that are in the data rather than deciding what is meaningful (more general patterns, for example) in advance and looking only for them.
Instead of assuming that general patterns are more effective than memorizing specific phrases, today’s translation models introduce general rules only when they improve translation over just memorizing particular phrases (for instance, in rules for dates and numbers). Similar observations have been made in every other application of machine learning to Web data: simple n-gram models or linear classifiers based on millions of specific features perform better than elaborate models that try to discover general rules.That's what I wanted to hear. I sure would love to see that list of "every other application of machine learning to Web data". Maybe I'll write and ask for it. But my experience using statistical learning approaches with large healthcare datasets has been been similar.
Here is an excerpt that distinguishes between semantic interpretation goals and the goals of the semantic web - I found this very helpful (and only hope I did actually understand!):
The problem of understanding human speech and writing—the semantic interpretation problem—is quite different from the problem of software service interoperability. Semantic interpretation deals with imprecise, ambiguous natural languages, whereas service interoperability deals with making data precise enough that the programs operating on the data will function effectively.Finally - advice that I agree with, so far:
So, follow the data. Choose a representation that can use unsupervised learning on unlabeled data, which is so much more plentiful than labeled data. Represent all the data with a nonparametric model rather than trying to summarize it with a parametric model, because with very large data sources, the data holds a lot of detail.... invariably,
simple models and a lot of data trump
more elaborate models based on less
data.