Prediction vs. Description or: Data Science vs. Market Research

“My market research indicates that 50% of your customers are above the median age. But the shocking discovery was that 50% were below the median age.”
(Dilbert; read it somewhere, cant remember the source)

It was funny to see everyone at O’Reilly’s Strata Conference talk about data science and hear just the dinosaurs like Microsoft, Intel or SAP still calling it “Big Data”. Now, for me, too, data science is the real change; and I tell you, why:

What always annoyed me when working with market researchers: you never get an answer. All you get is a description of the sample. Drawing samples was for sure a difficult task 50 years ago. You had to send interviews arround, using a kish grid (does anyone remember this – at least outside Germany?). The data had to be coded into punch cards and clumsy software was used to plot elementary descriptives from ascii-letters. If you still use SPSS, you might know what I am talking about. When I studied statistics in the early 90s, testing hypotheses was much more important than predictions, and visualisaton was not invented yet. The typical presentation of a market researcher would thus start with describing the sample (50% male, 25% from 20 to 39 years, etc.) and in the end, they would leave the client with some more or less trivialy aggregated Excel-Tables.

When I became in charge of pricing ad breaks of a large TV network, all this research was useless for my purposes. My job required predicting the measured audiences of each of the approximately 40 ad breaks for every of our four national stations six weeks in advance. I had to make the decission in real time, no matter how accurate the information I calculated the risks on would have been.

Market research is bad in supporting real time management decissions. So managers tend to decide on their “gut feelings”. But the framework has changed. The last decade brought to us the possibility to access huge data sets with low latency and run highly multivariate models. You cant do online advertising targeting based on gut feelings.

But most market researchers would still argue that the analytics behind ad targeting are not market research because they would just rely on probabilistic decissions, on predictions based on correlations rather than causality. Machine learning does not test a hypothesis that was derived from a theoretical construct of ideas. It identifies patterns and the prediction would be taken as accurate just if the effect on the ROI would be better then before.

I can very well live with the researchers keeping to their custom as long as I may use my data to do the predictions I need. When attending Strata Conference, I realized this deep paradigm shift from market research, describing data as its own end to data science, getting to predicitons.

Maybe it is thus a good thing to differentiate between market research and data science.

(This is the first in a row of posts on our impressions at Strata this year; the others will follow quickly …)

Algorithmic Glass Bead Games – Why predicting Twitter trends will not change the world

The last hours, I’ve seen a lot of tweets mentioning this great new algorithm by MIT professor Devavrat Shah. The UK Wired, The Verge, Gigaom, The Atlantic Wire and Forbes all posted stories on this fantastic discovery. And this has only been the weekend. Starting next week, there will be a lot more articles celebrating this breakthrough in machine learning.

At first, I was very enthusiastic as well and tweeted the MIT press release. A new algorithm – great stuff! But then slowly, I began to think about this whole thing. This new algorithm claims to predict trending topics on Twitter. But this is a lot different from an algorithm predicting e.g. the outcome of presidential elections or other external events. Trending topics are nothing more than the result of an algorithm themselves:

Trends are determined by an algorithm and are tailored for you based on who you follow and your location. This algorithm identifies topics that are immediately popular, rather than topics that have been popular for a while or on a daily basis, to help you discover the hottest emerging topics of discussion on Twitter that matter most to you.

So, what Shah et al developed is an algorithm that is predicting the outcome of an algorithm. A lot of the coverage suggests that this new algorithm could be very useful for Twitter – because then they would not have to wait for the results of their own algorithm that is defining trends but could use the much brand new algorithm that gives the results 1.5 hours in advance:

The algorithm could be of great interest to Twitter, which could charge a premium for ads linked to popular topics.

What’s next? A Stanford professor that develops an algorithm that can predict the outcome of the Shah algorithm some 1.5 hours in advance? Or what about Google? Maybe someone will invent an algorithm predicting the PageRank for web pages? Oh, wait, something like this has already been invented. Maybe you’ll better know this under its acronym “SEO” or “Search Engine Optimization”.

Wikipedia Attention and the US elections

One of the most interesting challenges of data science are predictions for important events such as national elections. With all those data streams of billions of posts, comments, likes, clicks etc. there should be a way to identify the most important correlations to make predictions about real-world behavior such as: going to the voting booth and chosing a candidate.

A very interesting data source in this respect is the Wikipedia. Why? Because Wikipedia is

  1. a) open (data on page-views, edits, discussions are freely available on daily or even hourly basis),
  2. b) huge (WP currently ranks as #6 of all web sites worldwide and reaches about a quarter of all online users),
  3. c) specific (people visit the Wikipedia because they want to know something about some topic)

The first step was comparing the candidates Barack Obama and Mitt Romney over time. The resulting graph clearly shows the pivoting points of Obama’s presidential career (click to zoom):

Obama vs. Romney 2009-2012 (Wikipedia data)
Obama vs. Romney 2009-2012 (Wikipedia data)

But it also shows how strong Mitt Romney has been since the Republican primaries in January 2012. His Wikipedia page had attracted a lot more visitors in August and September 2012 than his presidential rival’s. Of course, this measure only shows attention, not sentiment. So it cannot be inferred from this data whether the peaks were positive or negative peaks. In terms of Wikipedia attention, Romney’s infamous 47% comments in September 2012 were more than 1/3 as important as Obama’s inauguration in January 2009.

Now, let’s add some further curves to this graph: Obama’s and McCain’s Wikipedia attention during the last elections:

Obama vs. Romney 2012 compared to Obama vs. McCain 2008 (Wikipedia data)
Obama vs. Romney 2012 compared to Obama vs. McCain 2008 (Wikipedia data)

Here’s another version with weekly data:

Obama vs. Romney 2012 compared to Obama vs. McCain 2008 (Wikipedia data, weeks)
Obama vs. Romney 2012 compared to Obama vs. McCain 2008 (Wikipedia data, weeks)

It’s almost instantly clear how much more attention Obama’s 2008 campaign (in red) gathered in comparison with his 2012 campaign (in green). On the other hand, Mitt Romney is at least when it comes to Wikipedia attention more interesting than McCain had been.

Here’s a comparison of Obama’s 2008 campaign vs. his 2012 campaign:

Obama 2008 vs. Obama 2012 (Wikipedia data)
Obama 2008 vs. Obama 2012 (Wikipedia data)

The last question: Is Mitt Romney 2012 as strong as Obama had been in 2008? Here’s a direct comparison:

Obama 2008 vs. Romney 2012 (Wikipedia data, weekly)
Obama 2008 vs. Romney 2012 (Wikipedia data, weekly)

A side-remark: I also did a correlation of this data set with Google Correlate. And guess what: The strongest correlation of the data for Obama’s 2012 campaign is the Google search query for “barack obama wikipedia”. There still seem to be a huge number of people using Google as their Wikipedia search-engine.

Google Correlate result for the Wikipedia time series "Barack Obama"
Google Correlate result for the Wikipedia time series “Barack Obama”

But this result could also be interpreted the other way round: If there is a strong correlation between Wikipedia usage and Google search queries, this makes Wikipedia an even more important data source for analyses.