A common problem in machine learning is the “rare case” situation. In many classification problems, the class of interest (fraud, purchase by a web visitor, death of a patient) is rare enough that a data sample may not have enough instances to generate useful predictions. One way to deal with this problem is, in essence, data fabrication: attaching synthetic class labels to cases where we don’t know the actual label.
This is called label propagation or label spreading and sounds bogus. However, it has worked in test cases. The idea is as follows:
1. Start with a small number of cases where the label (class) is known. (We have only a small number of 1’s, the class of interest, as the class occurs only rarely).
2.Identify additional cases where the label is unknown but the case is very similar to the known 1’s in other respects.
3.Label those cases as 1’s.
4.Combine the real 1’s with the artificial 1’s and use it as the training data for a model.
Granted, a source of error is introduced: we are only guessing at the synthetic labels. Simulations, though, have shown that this can be more than offset by the reduction in another type of error: small sample error. Label spreading takes advantage of the information contained in the predictor values for the similar cases. It is analogous to imputing missing data, which also allows us to use more of the data in fitting a model.
Label spreading is typically applied to graph data; i.e., data that describe the links (edges) between cases (nodes) in a network. Nodes with unknown labels can take the label that predominates in the nearby network community.