There are more than 3 dozen curses in Harry Potter. Data scientists have only one – the “curse of dimensionality.” Dimensionality is the number of predictors or input variables in a model, and the “curse” refers to the problems that result from including too many features (predictor variables) in a model.
Old curses are awakened in The Mummy (1932)
As variables are added, the data space becomes increasingly sparse, and classification and prediction models fail because the available data are insufficient to provide a useful model across so many variables. The difficulties posed by adding a variable increase exponentially with the addition of each variable.
One way to think of this intuitively is to consider the location of an object on a chessboard. It has two dimensions and 8×8 = 64 squares or choices. If you expand the chessboard to a cube, you increase the dimensions by 50%—from 2 dimensions to 3 dimensions. However, the location options increase by 800%, to 512 (8 x 8 x 8). In all models, the addition of numerous predictors increases the chance of fitting the model to noise. It also increases the chance of correlated predictors, which can create instability in a model (e.g. in a tree model, one sample might yield a split on variable X, while another sample from the same data may yield a split on correlated variable Y).
For methods that rely on statistical distance (e.g. nearest neighbors, clustering), the proliferation of variables means that nothing is close to anything else anymore—too much noise has been added and patterns and structure are no longer discernible. The curse is particularly acute in Big Data applications, including genomics, where, for example, an analysis might have to deal with values for thousands of different genes. Text mining, where features may represent the presence, absence or position of words, is another application where dimensionality must be accounted for.
One of the key steps in data mining, therefore, is finding ways to reduce dimensionality with minimal sacrifice of accuracy. There are several approaches.
1. “Manual” exploration/reduction of the data. For example, in marketing applications customer location as indicated by zipcode may be a useful feature, but there are 42,000+ zipcodes in the U.S. A dummy indicator variable would be needed for each. Some preliminary correlation analysis can indicate whether zipcode is correlated with the target outcome. If not, it could be ignored. Or, groups of zipcodes may be clustered. An initial step would be to use just the initial digits of the zipcode, say the first 3, to make the geographical units larger. Or, domain knowledge of what areas behave in what ways might limit the needed variables drastically. For example, you could map zipcodes to Claritas’s market segments, thereby reducing 42,000 variables to just 68 segments. Claritas is a consumer behavior consulting firm, so this mapping service is available commercially.
2. Principal Components Analysis (PCA). Intended for numerical variables, PCA is a statistical method that consolidates a large number of variables into a smaller set of principal components, where each component is a linear combination of a set of the original variables. PCA proceeds step by step. Its first step is to assign weights to all variables in such a way that the linear combination using those weights accounts for as much of the overall variance in the variables as possible. This is the first component. The next step is to assign new weights to the variables, in such a way as to account for as much of the remaining variance as possible. This is the second component. This process proceeds with additional components, each explaining a smaller and smaller portion of the variance, until you have as many components as variables and 100% of the variance is explained. Often, dozens or hundreds of variables may be replaced with a handful of principal components that explain nearly all the variance. By selecting a limited number of components to build a model, the problems associated with dimensionality are mitigated. In text mining there is an analogous process – latent semantic indexing – that replaces the (large) matrix of terms in documents by a smaller matrix that often can be mapped to meaningful concepts.
3. Deep Learning. Deep learning methods employ complex neural nets to “learn” high level features. For example, the raw features in an image are pixel values, of which there may be tens or hundreds of thousands. A deep learning network will distill those pixel values into a much smaller set of high level features. It might first learn to identify edges and shapes, then more complex features like faces;. Deep learning has had the greatest impact in image and voice recognition, but was recently applied to medical records where the goal was to build a predictive model on hospital readmission. The task involved 46 billion data points, incorporating numerical variables, text variables, and sequential observations for more than 200,000 patients. Rather than spending a lot of time on “feature engineering” (reduction of dimensionality), the analysts simply dumped the data in a deep learning network. Without being told, the network learned which variables were important and instrumental in successful predictions. The network avoided the problem of noise and overfitting through its tiered process of developing high level features – i.e. automated dimension reduction. Read this article for more about the medical record task, and this article for more detail about how deep learning works.