Blog

BENFORD’S LAW

Benford’s law describes an expected distribution of the first digit in many naturally-occurring datasets.

CONTINGENCY TABLES

Contingency tables are tables of counts of events or things, cross-tabulated by row and column.

HYPERPARAMETER

Hyperparameter is used in machine learning, where it refers, loosely speaking, to user-set parameters, and in Bayesian statistics, to refer to parameters of the prior distribution.

Why sample? A while ago, sample would not have been a candidate for Word of the Week, its meaning being pretty obvious to anyone with a passing acquaintance with statistics. I select it today because of some output I saw from a decision tree in Python.

SPLINE

The easiest way to think of a spline is to first think of linear regression – a single linear relationship between an outcome variable and various predictor variables.

NLP

To some, NLP = natural language processing, a form of text analytics arising from the field of computational linguistics.

OVERFIT

As applied to statistical models – “overfit” means the model is too accurate, and fitting noise, not signal. For example, the complex polynomial curve in the figure fits the data with no error, but you would not want to rely on it to predict accurately for new data:

Kidney Donations – Optimizing Them

Did you know that there is a country in which kidneys can be legally bought and sold? That country is Iran, and there are actually two models for payment. In one part of the country, donors receive a flat fee, and in the other part of the country prices are negotiated. This point wasContinue reading “Kidney Donations – Optimizing Them”

Quotes about Data Science

“The goal is to turn data into information, and information into insight.” – Carly Fiorina, former CEO, Hewlett-Packard Co. Speech given at Oracle OpenWorld “Data is the new science. Big data holds the answers.” – Pat Gelsinger, CEO, EMC, Big Bets on Big Data, Forbes“Hiding within those mounds of data is knowledge that could change the lifeContinue reading “Quotes about Data Science”

Week #18 – n

In statistics, “n” denotes the size of a dataset, typically a sample, in terms of the number of observations or records.

Week #17 – Corpus

A corpus is a body of documents to be used in a text mining task. Some corpuses are standard public collections of documents that are commonly used to benchmark and tune new text mining algorithms. More typically, the corpus is a body of documents for a specific text mining task – e.g. a set ofContinue reading “Week #17 – Corpus”

Historical Spotlight: Eugenics – journey to the dark side at the dawn of statistics

April 27 marks the 80th anniversary of the death of Karl Pearson, who contributed to statistics the correlation coefficient, principal components, the (increasingly-maligned) p-value, and much more. Pearson was one of a trio of founding fathers of modern statistics, the others being Francis Galton and Ronald Fisher. Galton, Pearson and Fischer were deeply involved withContinue reading “Historical Spotlight: Eugenics – journey to the dark side at the dawn of statistics”

Week #2 – Casual Modeling

Causal modeling is aimed at advancing reasonable hypotheses about underlying causal relationships between the dependent and independent variables. Consider for example a simple linear model: y = a0 + a1 x1 + a2 x2 + e where y is the dependent variable, x1 and x2 are independent variables, e is the contribution of all otherContinue reading “Week #2 – Casual Modeling”

Week #10 – Arm

In an experiment, an arm is a treatment protocol – for example, drug A, or placebo. In medical trials, an arm corresponds to a patient group receiving a specified therapy. The term is also relevant for bandit algorithms for web testing, where an arm consists of a specific web treatment or offer. Assigning a webContinue reading “Week #10 – Arm”

Week #9 – Sparse Matrix

A sparse matrix typically refers to a very large matrix of variables (features) and records (cases) in which most cells are empty or 0-valued. An example might be a binary matrix used to power web searches – columns representing search terms and rows representing searches, and cells populated by 1’s or 0’s (presence or absenceContinue reading “Week #9 – Sparse Matrix”

Week #8 – Homonyms department: Sample

We continue our effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics, a sample is a collection of observations or records. It is often, but not always, randomly drawn. In matrix form, the rows are records (subjects), columns are variables, and cell values are the valuesContinue reading “Week #8 – Homonyms department: Sample”

Week #7 – Homonyms department: Normalization

With this entry, we inaugurate a new effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics and machine learning, normalization of variables means to subtract the mean and divide by the standard deviation. When there are multiple variables in an analysis, normalization (also called standardization) removesContinue reading “Week #7 – Homonyms department: Normalization”

Week #43 – HDFS

HDFS is the Hadoop Distributed File System. It is designed to accommodate parallel processing on clusters of commodity hardware, and to be fault tolerant.

Week #42 – Kruskal – Wallis Test

The Kruskal-Wallis test is a nonparametric test for finding if three or more independent samples come from populations having the same distribution. It is a nonparametric version of ANOVA.

Week #32 – False Discovery Rate

A “discovery” is a hypothesis test that yields a statistically significant result. The false discovery rate is the proportion of discoveries that are, in reality, not significant (a Type-I error). The true false discovery rate is not known, since the true state of nature is not known (if it were, there would be no need for statistical inference).