Good psychics have a knack for getting their audience to reveal, unwittingly, information that can be turned around and used in a prediction. Statisticians and data scientists fall prey to a related phenomenon, leakage, when they allow into their models highly predictive features that would be unavailable at prediction time. In one noted example, a highly successful machine learning model was developed to classify skin moles as cancerous or benign. The model’s success was too good to be true – most of the malignant moles in the training images had small rulers placed there by the examining dermatologist.
Only moles judged by the doctor as likely to be malignant got the rulers, so the model was able to use the presence or absence of a ruler (and, in effect, the dermatologist’s opinion) as a highly effective predictor. A similar example was a model developed to predict whether a patient had prostate cancer. An extremely effective predictor was PROSSURG, an indicator variable for whether a patient had had prostate surgery. It seems an obvious error when looked at in isolation, but big data machine learning models often have hundreds or thousands of predictors with cryptic names. Leaks like this can be hard to identify a priori.
Time Series Forecasting is Prone to Leaks
John Elder, Founder and Chairman of Elder Research (which owns Statistics.com) described several examples of leaks in financial models in his Top 10 Data Mining Mistakes. A couple:
- A neural net model prepared for a Chicago bank was 95% accurate in predicting interest rate changes. Unfortunately, a version of the output variable was included as a predictor. (This experience produced a useful window on the workings of the neural net, which lost 5% of the predictor’s information as it moved it through the network.)
- Elder was once called on to assess a proprietary model that was 70% accurate in forecasting directional market movements. Elder turned detective, and a painstaking investigation revealed that the model was (inadvertently) nothing more than a simple 3-day average centered on today. Forecasting tomorrow’s price is easy if tomorrow’s price is available to the model. (Omitting the first of the three days would have achieved 100% directional accuracy!)
Elder’s colleague, Mike Thurber, recounted a project he supervised whose goal was to forecast sales of a particular retail category. One of the features was year-over-year growth of category share, which leaked future information into the model just as the 3-day average Elder discovered did. Time series forecasting is particularly prone to this type of error. A moving average window centered at t and including information about t-1 and t+1 (and often further backward and forward) is a common way to smooth out random noise and visualize time series. If it is used for prediction, however, only information at t and prior, i.e. a trailing window, should be included.
Centered window: t-1 + t + t+1
Trailing window: t-2 + t-1 + t
Kaggle’s Leak Lessons
Kaggle’s general contest documentation has a section on leakage, and a description of general types of leak problems. We’ve discussed two of them, leaks from the future into time series forecasts, and inclusion of highly correlated predictors that will not be available at time of prediction. Some other examples include:
- Leaks from holdout (or test) data into training data. There are two types of holdout data (the nomenclature is not always clear). One type is used iteratively to tune and select model parameters, and to select top performing models. Another is set aside and never seen during model development; it is used to get an unbiased estimate of “real-world” model performance. Inclusion of test data in training data can occur due to mistakes during data preparation, and can contaminate the overall process. If there are leaks from the first type of holdout data, protections against overfitting during model-tuning will be reduced. If there are leaks from the second type, estimates of deployed model performance will be overly optimistic.
- Failure to remove obfuscated replacements for forbidden variables. For example, an app maker may create a predictor variable with random noise injected into a user’s location data, to allow sharing of data with partners or consultants without revealing exact location data. Internal data scientists attempting to predict future location might leave such a variable in the model, not knowing exactly what it is, and end up with unrealistic location predictions.
Other scenarios are inclusion of data not present in the model’s operational environment, and distortion from samples outside the scope of the model’s intended use.
Interestingly, Kaggle provides a good window on how not to approach leaks, for two reasons:
- In the Kaggle environment, the data, once released, are set in stone. There is no opportunity for the inquiring data scientist to go back and query the domain experts, and perhaps modify the available data, as ought to happen in a normal project environment.
- Leaks in the Kaggle world are good things, to be prized by those who find them and exploited for advantage. Kaggle reported one case of a team that achieved top ranking in a social network link prediction contest by uncovering and scraping the actual network the anonymized data came from, and thereby learning the ground truth.
Feature Engineering and Selection
As machine learning moves increasingly towards automation, with data flowing into models, and model output into deployed decisions, practitioners may open the door to leaks by using using automated methods for feature engineering and selection. The rapid growth of deep learning and its capacity to learn features may accelerate this trend. The most widespread use of deep learning now is for image and text processing, where extremely low-level feature (pixels and words) require a lot of consolidation and transformation before they become meaningful. However, it is spreading into models for standard tabular data, where some of the predictors may include unstructured text.
For example, one group of researchers developed a deep learning model to take raw medical record data (messy, with many features, but still in tabular form) and predict mortality, readmission and diagnosis (read more here). The data had over 100,000 features and included inconsistent medical coding and unstructured text. A deep learning model proved more accurate than a standard manual feature engineering approach, and considerably less labor intensive.
Some automated techniques can automatically check for possible leaks. If database entries are time-stamped, a rule-based system can exclude predictor information that came in after the target prediction time. Predictors that are highly correlated with the target can be flagged for review. (They would need special treatment, in any case, if linear models are used, due to the problem of multicollinearity.)
The problem of preventing leakage may represent a last redoubt of old-fashioned non-automated, human review of data. You can’t count on a standard preprocessing function to handle all leaks. An illegal predictor that has predictive value but whose correlation is not extreme might be missed.
Conclusion
Julia Child, in her classic French cookbook, warned against overreliance on food processors when baking. “You have to get your hands into the dough,” she said. In a similar way, automated methods can only go so far in protecting against leakage. To go further, you need to get your hands into the data and, understand what the predictors are, and, sometimes, play detective.