Register today for our Generative AI Foundations course. Use code GenAI99 for a discount price of $99!
Skip to content

Snowball Sampling

Snowball sampling is a form of sampling in which the selection of new sample subjects is suggested by prior subjects.  From a statistical perspective, the method is prone to high variance and bias, compared to random sampling. The characteristics of the initial subject may propagate through the sample to some degree, and a sample derived by starting with subject 1 may differ from that produced by by starting with subject 2, even if the resulting sample in both cases contains both subject 1 and subject 2.  However, …

The Statistics of Christmas Trees

A researcher shakes a sprig from a Christmas tree, and counts the number of needles that fall. He then repeats the process for countless other sprigs. The sprigs are from a variety of species, and the goal is to determine which species do the best job of retaining their needles. Falling needles are a definiteContinue reading “The Statistics of Christmas Trees”

The False Alarm Conundrum

False alarms are one of the most poorly understood problems in applied statistics and biostatistics. The fundamental problem is the wide application of a statistical or diagnostic test in search of something that is relatively rare. Consider the Apple Watch’s new feature that detects atrial fibrillation (afib). Among people with irregular heartbeats, Apple claims aContinue reading “The False Alarm Conundrum”

Conditional Probability Word of the Week

QUESTION:  The rate of residential insurance fraud is 10% (one out of ten claims is fraudulent).  A consultant has proposed a machine learning system to review claims and classify them as fraud or no-fraud.  The system is 90% effective in detecting the fraudulent claims, but only 80% effective in correctly classifying the non-fraud claims (it mistakenly labels one in five as “fraud”).  If the system classifies a claim as fraudulent, what is the probability that it really is fraudulent?

Instructor Spotlight – David Kleinbaum

David Kleinbaum developed several courses for Statistics.com, including Survival Analysis, Epidemiologic Statistics, and Designing Valid Statistical Studies. David retired a little over a year ago from Emory University, where he was a popular and effective teacher with the ability to distill and explain difficult statistical concepts with clarity and concision. David had a flair forContinue reading “Instructor Spotlight – David Kleinbaum”

Book Review: Active-Epi

ActivEpi Web, by David Kleinbaum, is the text used in two Statistics.com courses (Epidemiology Statistics and Designing Valid Studies), but it is really a rich multimedia web-based presentation of epidemiological statistics, serving the role of a unique textbook format for an introductory course in the subject. It is historically noteworthy – it dates back toContinue reading “Book Review: Active-Epi”

Churn

Churn is a term used in marketing to refer to the departure, over time, of customers.  Subscribers to a service may remain for a long time (the ideal customer), or they may leave for a variety of reasons (switching to a competitor, dissatisfaction, credit card expires, customer moves, etc.).  A customer who leaves, for whatever reason, “churns.”

Eli Whitney and Google

This weekend (12/8/2018) marked the 253rd anniversary of the birth of Eli Whitney, inventor of the cotton gin. And 20 years ago, Google received its first big infusions of capital from, among others, Jeff Bezos, the founder of Amazon. Both Eli Whitney and the Google founders instigated economic revolutions, but also illustrate polar opposite approachesContinue reading “Eli Whitney and Google”

How Google Determines Which Ads you See

A classic machine learning task is to predict something’s class, usually binary – pictures as dogs or cats, insurance claims as fraud or not, etc. Often the goal is not a final classification, but an estimate of the probability of belonging to a class (propensity), so the cases can be ranked. A good example ofContinue reading “How Google Determines Which Ads you See”

Job Spotlight: Data Scientist

Data science is one of a host of similar terms. Artificial intelligence has been around since the 1960’s and data mining for at least a couple of decades. Machine learning came out of the computer science community, and analytics, data analytics, and predictive analytics came out of the statistics and OR communities. Among all ofContinue reading “Job Spotlight: Data Scientist”

ROC Curve

The Receiver Operating Characteristics (ROC) curve is a measure of how well a statistical or machine learning model (or a medical diagnostic procedure) can distinguish between two classes, say 1’s and 0’s.  For example, fraudulent insurance claims (1’s) and non-fraudulent ones (0’s). It plots two quantities:

 

Triage and Artificial Intelligence

Predictim is a service that scans potential babysitters’ social media and other online activity and issues them a score that parents can use to select babysitters. Jeff Chester, the executive director of the Center for Digital Democracy, commented: There’s a mad rush to seize the power of AI to make all kinds of decisions withoutContinue reading “Triage and Artificial Intelligence”

Deming’s Funnel Problem

W. Edwards Deming’s funnel problem is one of statistics’ greatest hits. Deming was a noted statistician who took the statistical process control methods of Shewhart and expanded them into a holistic approach to manufacturing quality. Initially, his ideas were cooly received in the US and he ended up implementing them first in Japan. The successContinue reading “Deming’s Funnel Problem”

Industry Spotlight: the Auto Industry

The auto industry serves as a perfect exemplar of three key eras of statistics and data science in service of industry: Total Quality Management (TQM) First in Japan, and later in the U.S., the auto industry became an enthusiastic adherent to the Total Quality Management philosophy. Fundamentally, TQM is all about using data to improveContinue reading “Industry Spotlight: the Auto Industry”

Analytics Professionals – Must They Be Good Communicators?

Most job ads in the technical arena list communication among the sought-after skills; it consistently outranks many programming and analytical skills. Is it for real, or is it just thrown in there by the HR Department on general principle? The founder of a leading analytics services firm told me that good biostatisticians in the pharmaceuticalContinue reading “Analytics Professionals – Must They Be Good Communicators?”

Random Selection for Harvard Admission?

An ethical algorithm… Ethics in algorithms is a popular topic now. Usually the conversation centers around the possible unintentional bias or harm that a statistical or machine learning algorithm could do when it is used to select, score, rate, or rank people. For example – a credit scoring algorithm may include a predictor that isContinue reading “Random Selection for Harvard Admission?”

GE Regresses to the Mean

Thirty years ago, GE became the brightest star in the firmament of statistical ideas in business when it adopted Six Sigma methods of quality improvement. Those methods had been introduced by Motorola, but Jack Welch’s embrace of the same methods at GE, a diverse manufacturing powerhouse, helped bring stardom to industrial statisticians. Last week, GE’sContinue reading “GE Regresses to the Mean”