Blog

Coronavirus – in Search of the Elusive Denominator

Anyone with internet access these days has their eyes on two constellations of data – the spread of the coronavirus, and the resulting collapse of the financial markets. Following the 13% one-day drop of the stock market a week ago, The Wall Street Journal forecast a quarterly GDP drop of as much as 10% –Continue reading “Coronavirus – in Search of the Elusive Denominator”

Coronavirus: To Test or Not to Test

In recent years, under the influence of statisticians, the medical profession has dialed back on screening tests. With relatively rare conditions, widespread testing yields many false positives and doctor visits, whose collective cost can outweigh benefits. Coronavirus advice follows this line – testing is limited to the truly ill (this is also due to aContinue reading “Coronavirus: To Test or Not to Test”

Mar 16: Statistics in Practice

In this week’s Brief, we look at combining models. Our course spotlight is April 17 – May 1: Maximum Likelihood Estimation (MLE) You’ve probably seen lots of references to MLE in other contexts – this quick 2-week course (only $299) is your chance to study it on its own. See you in class! – PeterContinue reading “Mar 16: Statistics in Practice”

Regularized Model

In building statistical and machine learning models, regularization is the addition of penalty terms to predictor coefficients to discourage complex models that would otherwise overfit the data. An example is ridge regression.

Ensemble Learning

In his book, The Wisdom of Crowds, James Surowiecki recounts how Francis Galton, a prominent statistician from the 19th century, attended an event at a country fair in England where the object was to guess the weight of an ox. Individual contestants were relatively well informed on the subject (the audience was farmers), but theirContinue reading “Ensemble Learning”

Mar 9: Statistics in Practice

In this week’s Brief, we look at ways to determine optimal sample size. Our course spotlight is April 10 – May 8: Sample Size and Power Determination See you in class! – Peter Bruce Founder, Author, and Senior Scientist Big Sample, Unreliable Result The 1948 Kinsey report on male sexual behavior in the U.S. yieldedContinue reading “Mar 9: Statistics in Practice”

Ridge Regression

Ridge regression is a method of penalizing coefficients in a regression model to force a more parsimonious model (one with fewer predictors) than would be produced by an ordinary least squares model. The term “ridge” was applied by Arthur Hoerl in 1970, who saw similarities to the ridges of quadratic response functions. In ordinary leastContinue reading “Ridge Regression”

Big Sample, Unreliable Result

Which would you rather have? A large sample that is biased, or a representative sample that is small? The American Statistical Association committee that reviewed the 1948 Kinsey report on male sexual behavior, based on interviews with over 5000 men, left no doubt of their preference for the latter. The statisticians – William Cochran, FrederickContinue reading “Big Sample, Unreliable Result”

Mar 2: Statistics in Practice

In this week’s Brief, we look at hierarchical and mixed models. Our course spotlight is April 10 – May 8: Generalized Linear Models April 24 – May 22: Mixed and Hierarchical Linear Models See you in class! – Peter Bruce Founder, Author, and Senior Scientist Mixed Model – When to Use? In 1861, the BritishContinue reading “Mar 2: Statistics in Practice”

Problem of the Week: Notify or Don’t Notify?

Our problem of the week is an ethical dilemma, posed by the New England Journal of Medicine to its readers 10 days ago. Volunteers contributed DNA samples to investigators building a genetic database for study, on condition the data would be deidentified and kept confidential and that they themselves would not learn results. Should participantsContinue reading “Problem of the Week: Notify or Don’t Notify?”

Factor

The term “factor” has different meanings in statistics that can be confusing because they conflict. In statistical programming languages like R, factor acts as an adjective, used synonymously with categorical – a factor variable is the same thing as a categorical variable. These factor variables have levels, which are the same thing as categories (aContinue reading “Factor”

Mixed Models – When to Use

Companies now have a lot of data on their customers at an individual level. Suppose you are tasked with forecasting customer spending at a grocery chain, and you want to understand how customer attributes, local economic factors, and store issues affect customer spending. You could design your study with hierarchical and mixed linear modeling methodsContinue reading “Mixed Models – When to Use”

Feb 24: Statistics in Practice

In this week’s Brief, we look at social categories, and the role that statistics and data science have played in social engineering – 100 years ago and today. Our course spotlight is April 3 – May 1: Categorical Data Analysis See you in class! – Peter Bruce Founder, Author, and Senior Scientist The Normal ShareContinue reading “Feb 24: Statistics in Practice”

The Normal Share of Paupers

In 2009, China began regional pilot programs that repurposed credit scores to a broader purpose – scoring a person’s “social credit.” 100 years earlier, at the height of the eugenics craze, the famous statistician Francis Galton undertook to repurpose statistical concepts in service of social engineering. The starting point was a social survey of LondonContinue reading “The Normal Share of Paupers”

Purity

In classification, purity measures the extent to which a group of records share the same class. It is also termed class purity or homogeneity, and sometimes impurity is measured instead. The measure Gini impurity, for example, is calculated for a two-class case as p(1-p), where p = the proportion of records belonging to class 1. Continue reading “Purity”

Predictor P-Values in Predictive Modeling

Not So Useful Predictor p-values in linear models are a guide to the statistical significance of a predictor coefficient value – they measure the probability that a randomly shuffled model could have produced a coefficient as great as the fitted value. They are of limited utility in predictive modeling applications for various reasons: Software typicallyContinue reading “Predictor P-Values in Predictive Modeling”

UpLift and Persuasion

The goal of any direct mail campaign, or other messaging effort, is to persuade somebody to do something. In the business world, it is usually to buy something. In the political world, it is usually to vote for someone (or, if you think you know who they will vote for, to encourage them to actuallyContinue reading “UpLift and Persuasion”

Feb 17: Statistics in Practice

Last week we looked at several metrics for assessing the performance of classification models – accuracy, receiver operating characteristics (ROC) curves, and lift (gains). In this week’s Brief we move beyond lift and cover uplift. Our course spotlight again is: Feb 28 – Mar 27: Persuasion Analytics and Targeting See you in class! –Continue reading “Feb 17: Statistics in Practice”

ROC, Lift and Gains Curves

There are various metrics for assessing the performance of a classification model. It matters which one you use. The simplest is accuracy – the proportion of cases correctly classified. In classification tasks where the outcome of interest (“1”) is rare, though, accuracy as a metric falls short – high accuracy can be achieved by classifyingContinue reading “ROC, Lift and Gains Curves”

Feb 10: Statistics in Practice

Tomorrow is the New Hampshire political primary in the US, and this week’s Brief looks at the statistical concept of lift. Our spotlight is on: Feb 28 – Mar 27: Persuasion Analytics and Targeting See you in class! – Peter Bruce, Founder Lift and Persuasion What do you do with late-paying and defaulting customers? Continue reading “Feb 10: Statistics in Practice”