Monday, December 15, 2008

Case-control design for rare events modeling

In predictive modeling of rare events, such as hospitalization (or, international conflict escalating to war), the rareness of the event of interest means that very large samples are needed in order to provide enough input information to 'learn' what predictors are most informative.

One way around this is to implement a more efficient sampling design. An obvious choice is the 'case-control' design: using all of the observations corresponding to events, and a simple random sample of non-events. This provides a richer source of training data and it should improve predictive performance.

Predicted probabilities resulting from such a sample will be artificially high, and must be adjusted in order to correct for the sampling design. In Logistic Regression in Rare Events Data, Gary King and Langche Zeng develop corrections for finite sample and rare events bias, and standard error inconsistency that is useful when selecting based on the outcome variable as in a case-control study.

For the logit model, prior correction is shown to be consistent, fully efficient, and easy to apply. Explicit expressions are provided in Appendix B. Software that implements the methods in this paper using Stata is available from http://GKing.Harvard.Edu

No comments:

Post a Comment