Nobody expects Twitter feed sentiment analysis to give you unbiased results the way a well-designed survey will. A Pew Research study found that Twitter political opinion was, at times, much more liberal than that revealed by public opinion polls, while it was more conservative at other times. Two statisticians speaking at the Joint Statistical Meetings explored a different angle of the question – can Twitter data add value to traditional statistical methods?
Aron Culotta of the Illinois Institute of Technology examined 27 health statistics for the 100 most populous counties in the U.S., and investigated whether analysis of sentiment in 4 million tweets added value to the predictions made by using traditional demographic variables. He found it could – Twitter data was a significant predictor when added to the models for several health outcomes. And, since he had county demographic data at hand, he also developed models to classify race and gender for individuals tweeting, and was able to estimate correction factors to re-weight the Twitter data (which over-represents females and blacks).
Ashley Richards of RTI International investigated the possibility of using Twitter data to impute missing data in surveys where non-response is a factor. Her small sample size precluded sweeping conclusions, but her study of masked survey responses found that Twitter sentiment analysis could predict survey responses in some cases better than standard multiple imputation methods. Both machine learning sentiment analysis and human review were used with the Twitter data, with somewhat different effects.