The 2006 Netflix Contest has come to convey the idea of crowdsourced predictive modeling, in which a dataset and a prediction challenge are made publicly available. Individuals and teams then compete to develop the best performing model.
Blog
Week #20 – R
This week’s word is actually a letter. R is a statistical computing and programming language and program, a derivative of the commercial S-PLUS program, which, in turn, was an offshoot of S from Bell Labs.
Be Smarter Than Your Devices: Learn About Big Data
When Apple CEO Tim Cook finally unveiled his company’s new Apple Watch in a widely-publicized rollout earlier this month, most of the press coverage centered on its cost ($349 to start) and whether it would be as popular among consumers as the iPod or iMac. Nitin Indurkhya saw things differently. “I think the most significantContinue reading “Be Smarter Than Your Devices: Learn About Big Data”
Week #16 – Moving Average
In time series forecasting, a moving average is a smoothing method in which the forecast for time t is the average value for the w periods ending with time t-1.
Week #15 – Interaction term
In regression models, an interaction term captures the joint effect of two variables that is not captured in the modeling of the two terms individually.
Week #14 – Naive forecast
A naive forecast or prediction is one that is extremely simple and does not rely on a statistical model (or can be expressed as a very basic form of a model).
week #9 – Overdispersion
In discrete response models, overdispersion occurs when there is more correlation in the data than is allowed by the assumptions that the model makes.
Week #8 – Confusion matrix
In a classification model, the confusion matrix shows the counts of correct and erroneous classifications. In a binary classification problem, the matrix consists of 4 cells.
Week #5 – Features vs. Variables
The predictors in a predictive model are sometimes given different terms by different disciplines. Traditional statisticians think in terms of variables.
Course Spotlight: The Text Analytics Sequence
Text analytics or text mining is the natural extension of predictive analytics, and Statistics.com’s text analytics program starts Feb. 6. Text analytics is now ubiquitous and yields insight in: Marketing: Voice of the customer, social media analysis, churn analysis, market research, survey analysis Business: Competitive intelligence, document categorization, human resources (voice of the employee), recordsContinue reading “Course Spotlight: The Text Analytics Sequence”
Course Spotlight: Constrained Optimization
Say you operate a tank farm (to store and sell fuel). How much of each fuel grade should you buy? You have specified flow and storage capacities, constraints on what types of fuels can be stored in which tanks, prior contractual obligations about minimum monthly deliveries and incoming supplies, plus the opportunity to sell onContinue reading “Course Spotlight: Constrained Optimization”
College Credit Recommendation
Statistics.com Receives College Recommendation from the American Council on Education (ACE) College Credit Recommendation for Online Data Science Courses from The Institute for Statistics Education at Statistics.com LLC The American Council on Education‘s College Credit Recommendation Service (ACE CREDIT) has evaluated and recommended college credit for 5 more of The Institute for Statistics Education atContinue reading “College Credit Recommendation”
Week #48 – Structured vs. unstructured data
Structured data is data that is in a form that can be used to develop statistical or machine learning models (typically a matrix where rows are records and columns are variables or features).
Big Data and Clinical Trials in Medicine
There was an interesting article a couple of weeks ago in the New York Times magazine section on the role that Big Data can play in treating patients — discovering things that clinical trials are too slow, too expensive, and too blunt to find. The story was about a very particular set of lupus symptoms,Continue reading “Big Data and Clinical Trials in Medicine”
Word #39 – Censoring
Censoring in time-series data occurs when some event causes subjects to cease producing data for reasons beyond the control of the investigator, or for reasons external to the issue being studied.
Industry Spotlight: The brand premium for Chanel and Harvard
The classic illustration of the power of brand is perfume – expensive perfumes may cost just a few dollars to produce but can be sold for more than $500 due to the cachet afforded by the brand. David Malan’s Computer Science course at Harvard, CSCI E-50, provides an interesting parallel in the education world. It’sContinue reading “Industry Spotlight: The brand premium for Chanel and Harvard”
Twitter Sentiment vs. Survey Methods
Nobody expects Twitter feed sentiment analysis to give you unbiased results the way a well-designed survey will. A Pew Research study found that Twitter political opinion was, at times, much more liberal than that revealed by public opinion polls, while it was more conservative at other times. Two statisticians speaking at the Joint Statistical MeetingsContinue reading “Twitter Sentiment vs. Survey Methods”
Internet of Things
Boston, August 3 2014: Bill Ruh, GE Software Center, says that the Internet of Things, 30 billion machines talking to one another, will dwarf the impact of the consumer internet. Speaking at the Joint Statistical Meetings today, Ruh predicted that the marriage of the IoT and analytics will yield $1 trillion in savings or productivityContinue reading “Internet of Things”
Work #32 – Predictive modeling
Predictive modeling is the process of using a statistical or machine learning model to predict the value of a target variable (e.g. default or no-default) on the basis of a series of predictor variables (e.g. income, house value, outstanding debt, etc.).
Week #29 – Goodness-of-fit
Goodness-of-fit measures the difference between an observed frequency distribution and a theoretical probability distribution which