Terminology in Data Analytics
As data continue to grow at a faster rate than either population or economic activity, so do organizations’ efforts to deal with the data deluge, and use it to capture value. And so do the methods used to analyze data, which creates an expanding set of terms (including some buzzwords) used to describe these methods.
This is a field in flux, and different people may have different conceptions of what terms mean. Comments on this page and its “definitions” are welcome. Since many of these terms are subsets of others, or overlapping, the clearest approach is to start with the more specific terms and move to the more general.
→Test Yourself
Take a 10-question quiz on analytics
Predictive modeling:
Used when you seek to predict a target (outcome) variable (feature) using records (cases) where the target is known. Statistical or machine learning models are “trained” using the known data, then applied to data where the outcome variable is unknown. Includes both classification (where the outcome is categorical, often binary) and prediction (where the outcome is continuous). Wikipedia
Predictive analytics:
Basically the same thing as predictive modeling, but less specific and technical. Often used to describe the field more generally. Wikipedia
Supervised Learning:
Another synonym for predictive modeling.
Unsupervised Learning:
Data mining methods not involving the prediction of an outcome based on training models on data where the outcome is known. Unsupervised methods include cluster analysis, association rules, outlier detection, dimension reduction and more.
Business intelligence:
An older term that has come to mean the extraction of useful information from business data without benefit of statistical or machine learning models (e.g. dashboards to visualize key indicators, queries to databases). Wikipedia
Data mining:
This term means different things in different contexts. To a lay person, it might mean the automated searching of large databases. To an analyst. it may refer to the collection of statistical and machine learning methods used with those databases (predictive modeling, clustering, recommendation systems, …) Wikipedia
Text mining:
The application of data mining methods to text. Wikipedia
Text analytics:
A broader term that includes the preparation of text for mining, the mining itself, and specialized applications such as sentiment analysis. Preparing text for analysis involves automated parsing and interpretation (natural language processing), then quantification (e.g. identifying the presence or absence of key terms). Wikipedia
Data science, data analytics, analytics:
Cover all of the concepts described on this page. “Data science” is often used to define a (new) profession whose practitioners are capable in many or all the above areas; one often sees the term “data scientist” in job postings. While “statistician” typically implies familiarity with research methods and the collection of data for studies, “data scientist” implies the ability to work with large volumes of data generated not by studies, but by ongoing organizational processes. Due to the complexity of dealing with large datasets and data flows, most of the day-to-day work of a data scientist lies in data pipeline challenges – storing relevant data, getting it into appropriate form for analysis, and managing the real-time implementation of models. “Data analytics” and “analytics,” by contrast, are general terms used to describe the field and a comprehensive collection of associated methods. Wikipedia references here and here. All these terms tend to be used for the application of analytic methods to data that large organziations generate or have available (“big data”).
Statistics:
Covers nearly all of the above methods, and also carries the mantle of a well-established profession dating back to the mid 1800’s. Although statisticians work on “big data” problems, the field of statistics has traditionally been focused on focused research studies (e.g. drug trials).
Big Data:
Refers to the huge amounts of data that large businesses and other organizations collect and store. It might be unstructured text (streams of tweets) or structured quantitative data (transaction databases). In the 1990’s organizations began making efforts to extract useful information from this data. The challenges of big data lie mainly in the pre-analysis stage, in the IT domain.
Our friend, Gregory Piatetsky-Shapiro, Editor and Analytics/Data Mining Expert at KDnuggets conducted the following poll:
What was the largest dataset you analyzed / data mined?
This poll received 1108 votes, about 10% less than in 2016, but still a large enough sample. The results again show a surprising stability, fitting a pattern that emerged already in 2012, with a majority of data scientists and analysts working with data in Gigabytes range, and a small, but notable segment working with web-scale data of over 100 Petabytes.
Note that the poll asks about the largest ever dataset, so a typical dataset analyzed is expected to be significantly smaller.
Highlights:
- Gigabytes still rule: Majority of answers (56% in 2018, 57% in 2016, 56% in 2015, 54% in 2014, 53% in 2013) are in Gigabyte range. The overall median response was again between 11 and 100 GB (which comfortably fits on one laptop) for each year since 2012.
- Consistency: the shape of the curve each year is almost the same. Although in 2018 there were fewer responses in under 10MB range, and more in 1-10GB range, bit not significantly so.
- Petabyte Big Data Scientists still stand apart: There is a small but significant gap, with almost no answers in 1-10 PB range, which separates analysts who work with Terabyte-size commercial data warehouses and those who work with 100+ petabyte web-scale data stores. See for example a recent story on current Uber data warehouse of 100PB.
- Academic researchers on par with Government, Industry: The estimated median for academic researchers is 90GB, on par with Government (60 GB) and Industry analysts (50 GB). The estimated median answer has increased a little for all segments in 2018.
Fig. 1: KDnuggets Poll: Largest Dataset Analyzed, 2014-20182018
data is shown as a column, to stand apart from lines for previous years.
This poll also asked about employment type, and the breakdown was
- Company or Self-Employed, 62% (was also 62% in 2016)
- Student, 17% (was 20% in 2016)
- Academia/University, 13% (was 10% in 2016)
- Government/non-profit, 4.8% (was 5.1% in 2016)
- Other, 3.2% (was 2.4% in 2016)
Machine Learning:
Analytics in which computers “learn” from data to produce models or rules that apply to those data and to other similar data. Predictive modeling techniques such as neural nets, classification and regression trees (decision trees), naive Bayes, k-nearest neighbor, and support vector machines are generally included. One characteristic of these techniques is that the form of the resulting model is flexible, and adapts to the data. Statistical modeling methods that have highly structured model forms, such as linear regression, logistic regression and discriminant analysis are generally not considered part of machine learning. Unsupervised learning methods such as association rules and clustering are also considered part of machine learning.
Network Analytics:
The science of describing and, especially, visualizing the connections among objects. The objects might be human, biological or physical. Graphical representation is a crucial part of the process; Wayne Zachary’s classic 1977 network diagram of a karate club reveals the centrality of two individuals, and presages the club’s subsequent split into two clubs. The key elements are the nodes (circles, representing individuals) and edges or links (lines representing connections).
(Wayne Zachary. An information flow model for conflict and fission in small groups, Journal of Anthropological Research, 33(4):452–473, 1977; cited in D. Easley & J. Kleinberg, Networks, Crowds, and Markets: Reasoning about a Highly Connected World, Cambridge University Press, 2010, available also at http://www.cs.cornell.edu/home/kleinber/networks-book/ where this figure is drawn from.)
Social Network Analytics:
Network analytics applied to connections among humans. Recently it has come also to encompass the analysis of web sites and internet services like Facebook.
Web Analytics:
Statistical or machine learning methods applied to web data such as page views, hits, clicks, and conversions (sales), generally with a view to learning what web presentations are most effective in achieving the organizational goal (usually sales). This goal might be to sell products and services on a site, to serve and sell advertising space, to purchase advertising on other sites, or to collect contact information. Key challenges in web analytics are the volume and constant flow of data, and the navigational complexity and sometimes lengthy gaps that precede users’ relevant web decisions.
Uplift or Persuasion Modeling:
A combination of treatment comparisons (e.g. send a sales solicitation to one group, send nothing to another group) and predictive modeling to determine which cases or subjects respond (e.g. purchase or not) to which treatments. Here are the steps, in conceptual terms, for a typical uplift model:
1. Conduct A-B test, where B is control
2. Combine all the data from both groups
3. Divide the data into a number of segments, each having roughly similar numbers of subjects who got treatment A and control. Tree-based methods are typically used for this.
4. The segments should be drawn such that, within each segment, the response to treatment A is substantially different from the response to control.
5. Considering each segment as the modeling unit, build a model that predicts whether a subject will respond positively to treatment A.
The challenge (and the novelty) is to recognize that the model cannot operate on individual cases, since subjects get either treatment A, OR control, but not both, so the “uplift” from treatment Z compared to control cannot be observed at the individual level, but only at the group level. Hence the need for the segments described in steps 3 and 4.
Note: Traditional A-B testing would stop at step 1, and apply the more successful treatment to all subjects.
Reference: “Real World Uplift Modelling with Significance-Based Uplift Trees,” by N. J. Radcliffe and P. D. Surry, available as a white paper at stochasticsolutions.com/