Word of the Week Archives - Page 4 of 6 - Statistics.com: Data Science, Analytics & Statistics Courses

SAMPLE

Why sample? A while ago, sample would not have been a candidate for Word of the Week, its meaning being pretty obvious to anyone with a passing acquaintance with statistics. I select it today because of some output I saw from a decision tree in Python.

SPLINE

The easiest way to think of a spline is to first think of linear regression – a single linear relationship between an outcome variable and various predictor variables.

NLP

To some, NLP = natural language processing, a form of text analytics arising from the field of computational linguistics.

OVERFIT

As applied to statistical models – “overfit” means the model is too accurate, and fitting noise, not signal. For example, the complex polynomial curve in the figure fits the data with no error, but you would not want to rely on it to predict accurately for new data:

Week #18 – n

In statistics, “n” denotes the size of a dataset, typically a sample, in terms of the number of observations or records.

Week #17 – Corpus

A corpus is a body of documents to be used in a text mining task. Some corpuses are standard public collections of documents that are commonly used to benchmark and tune new text mining algorithms. More typically, the corpus is a body of documents for a specific text mining task – e.g. a set ofContinue reading “Week #17 – Corpus”

Week #2 – Casual Modeling

Causal modeling is aimed at advancing reasonable hypotheses about underlying causal relationships between the dependent and independent variables. Consider for example a simple linear model: y = a0 + a1 x1 + a2 x2 + e where y is the dependent variable, x1 and x2 are independent variables, e is the contribution of all otherContinue reading “Week #2 – Casual Modeling”

Week #10 – Arm

In an experiment, an arm is a treatment protocol – for example, drug A, or placebo. In medical trials, an arm corresponds to a patient group receiving a specified therapy. The term is also relevant for bandit algorithms for web testing, where an arm consists of a specific web treatment or offer. Assigning a webContinue reading “Week #10 – Arm”

Week #9 – Sparse Matrix

A sparse matrix typically refers to a very large matrix of variables (features) and records (cases) in which most cells are empty or 0-valued. An example might be a binary matrix used to power web searches – columns representing search terms and rows representing searches, and cells populated by 1’s or 0’s (presence or absenceContinue reading “Week #9 – Sparse Matrix”

Week #8 – Homonyms department: Sample

We continue our effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics, a sample is a collection of observations or records. It is often, but not always, randomly drawn. In matrix form, the rows are records (subjects), columns are variables, and cell values are the valuesContinue reading “Week #8 – Homonyms department: Sample”

Week #7 – Homonyms department: Normalization

With this entry, we inaugurate a new effort to shed light on potentially confusing usage of terms in the different data science communities. In statistics and machine learning, normalization of variables means to subtract the mean and divide by the standard deviation. When there are multiple variables in an analysis, normalization (also called standardization) removesContinue reading “Week #7 – Homonyms department: Normalization”

Week #43 – HDFS

HDFS is the Hadoop Distributed File System. It is designed to accommodate parallel processing on clusters of commodity hardware, and to be fault tolerant.

Week #42 – Kruskal – Wallis Test

The Kruskal-Wallis test is a nonparametric test for finding if three or more independent samples come from populations having the same distribution. It is a nonparametric version of ANOVA.

Week #32 – False Discovery Rate

A “discovery” is a hypothesis test that yields a statistically significant result. The false discovery rate is the proportion of discoveries that are, in reality, not significant (a Type-I error). The true false discovery rate is not known, since the true state of nature is not known (if it were, there would be no need for statistical inference).

Category Archives: Word of the Week

SAMPLE

SPLINE

NLP

OVERFIT

Week #18 – n

Week #17 – Corpus

Week #2 – Casual Modeling

Week #10 – Arm

Week #9 – Sparse Matrix

Week #8 – Homonyms department: Sample

Week #7 – Homonyms department: Normalization

Week #43 – HDFS

Week #42 – Kruskal – Wallis Test

Week #32 – False Discovery Rate

Week #23 – Netflix Contest

Week #20 – R

Week #16 – Moving Average

Week #15 – Interaction term

Week #14 – Naive forecast

week #9 – Overdispersion