Just a few hours before the ballots open for the 57th presidential election, the key question for us data scientists is: which data set could really show some special information, that would not be easily available through a classic poll. We have already seen some interesting correlations of Wikipedia usage with the ongoing campaign – just looking on how many people would search for the page on the candidates would provide a time series with many fascinating details.
Today we focused on the question, if the Democrats had been successful to rally Hispanic voters for Obama’s support. We took the Spanish Wikipedia, checkt the daily views of the Obama’s Spanish language Wikipedia-page and compared this with 2008 and also with the time series of his Republican competitors.
This table shows the monthly average for the daily views in 2008 and 2012:
Obama clearly leads – not only in absolute numbers but in particular regarding the increase of this year’s views with four years ago. While Romney would have gained 66% in views compared with McCain, Obama’s views would have more than doubled.
However, if we look just on the last week prior election day, we can see something strange happen: the view’s of Obama’s es.wikipedia-page have droped from the daily average of 5065 in 2008 to merely 3752 in 2012. The same is true for Romney versus McCain: 1795 average views in this year’s 44th week compared to 1986 in 2008.
This decreasing interest in the candidates is not reflected in the numbers that we see on other election-related search-terms. If we e.g. take ‘US Presidential Election’, we count 2672 daily views during the week before election day in 2008 and 3812 views in 2012 – the same rise in interest that we found in the English Wikipedia, too. (See the last post “Why the 2012 US elections are more exciting than 2008”). While the general interest in the elections is huge, the candidates no longer draw that much attention of the Spanish speaking community.
Maybe “Sandy” would work as an explanation since the campaing was halted during the Hurican – nevertheless it would not be plausible why only the candidates but not the election in general would suffer in awareness from this.
So we cannot draw a clear conclusion from our findings. There is evidence that Obama would have succeded to some extend to activate the interest of Hispanic people but regarding the unexpected drop we will have to drill further down. The real work, though, will anyway start right after the vote: to learn what would have been a signal and how we can seperate these from the noise next time.
Here’s an addition to my last post on using Wikipedia data to analyse attention for the US presidential elections 2012. Here’s another look at the interest not for the candidates’ Wikipedia pages but the general pages for the elections 2008 and 2012. Compared to the candidates’ pages, the attention for the general election page is much lower than for the candidates. Here’s the average values for October 2012:
Mitt Romney (2012): 98,138 Views / day
Barack Obama (2012): 63,104 Views / day
United States presidential election, 2012 (2012): 38,770
United States presidential election, 2008 (2008): 27,907
This monthly average hints at the 2012 elections being very exciting as the general election pages on Wikipedia have seen a 39% traffic increase compared to last elections. This also hold for the following time-series:
While the attention for the election pages in 2012 did not reach the level it had during the 2008 primaries, from mid October the 2012 campaigns were much more interesting according to the Wikipedia numbers. In 2008 we have seen a drop in attention before election day, in 2012 the suspense seems to build up.
The last hours, I’ve seen a lot of tweets mentioning this great new algorithm by MIT professor Devavrat Shah. The UK Wired, The Verge, Gigaom, The Atlantic Wire and Forbes all posted stories on this fantastic discovery. And this has only been the weekend. Starting next week, there will be a lot more articles celebrating this breakthrough in machine learning.
At first, I was very enthusiastic as well and tweeted the MIT press release. A new algorithm – great stuff! But then slowly, I began to think about this whole thing. This new algorithm claims to predict trending topics on Twitter. But this is a lot different from an algorithm predicting e.g. the outcome of presidential elections or other external events. Trending topics are nothing more than the result of an algorithm themselves:
Trends are determined by an algorithm and are tailored for you based on who you follow and your location. This algorithm identifies topics that are immediately popular, rather than topics that have been popular for a while or on a daily basis, to help you discover the hottest emerging topics of discussion on Twitter that matter most to you.
So, what Shah et al developed is an algorithm that is predicting the outcome of an algorithm. A lot of the coverage suggests that this new algorithm could be very useful for Twitter – because then they would not have to wait for the results of their own algorithm that is defining trends but could use the much brand new algorithm that gives the results 1.5 hours in advance:
The algorithm could be of great interest to Twitter, which could charge a premium for ads linked to popular topics.
What’s next? A Stanford professor that develops an algorithm that can predict the outcome of the Shah algorithm some 1.5 hours in advance? Or what about Google? Maybe someone will invent an algorithm predicting the PageRank for web pages? Oh, wait, something like this has already been invented. Maybe you’ll better know this under its acronym “SEO” or “Search Engine Optimization”.
One of the most interesting challenges of data science are predictions for important events such as national elections. With all those data streams of billions of posts, comments, likes, clicks etc. there should be a way to identify the most important correlations to make predictions about real-world behavior such as: going to the voting booth and chosing a candidate.
A very interesting data source in this respect is the Wikipedia. Why? Because Wikipedia is
a) open (data on page-views, edits, discussions are freely available on daily or even hourly basis),
b) huge (WP currently ranks as #6 of all web sites worldwide and reaches about a quarter of all online users),
c) specific (people visit the Wikipedia because they want to know something about some topic)
The first step was comparing the candidates Barack Obama and Mitt Romney over time. The resulting graph clearly shows the pivoting points of Obama’s presidential career (click to zoom):
But it also shows how strong Mitt Romney has been since the Republican primaries in January 2012. His Wikipedia page had attracted a lot more visitors in August and September 2012 than his presidential rival’s. Of course, this measure only shows attention, not sentiment. So it cannot be inferred from this data whether the peaks were positive or negative peaks. In terms of Wikipedia attention, Romney’s infamous 47% comments in September 2012 were more than 1/3 as important as Obama’s inauguration in January 2009.
Now, let’s add some further curves to this graph: Obama’s and McCain’s Wikipedia attention during the last elections:
Here’s another version with weekly data:
It’s almost instantly clear how much more attention Obama’s 2008 campaign (in red) gathered in comparison with his 2012 campaign (in green). On the other hand, Mitt Romney is at least when it comes to Wikipedia attention more interesting than McCain had been.
Here’s a comparison of Obama’s 2008 campaign vs. his 2012 campaign:
The last question: Is Mitt Romney 2012 as strong as Obama had been in 2008? Here’s a direct comparison:
A side-remark: I also did a correlation of this data set with Google Correlate. And guess what: The strongest correlation of the data for Obama’s 2012 campaign is the Google search query for “barack obama wikipedia”. There still seem to be a huge number of people using Google as their Wikipedia search-engine.
But this result could also be interpreted the other way round: If there is a strong correlation between Wikipedia usage and Google search queries, this makes Wikipedia an even more important data source for analyses.
It took a while for the three Vs of Big data to take off as one of the most frequently quoted buzzwords of BigData. Doug Laney of Meta Group (now aquired by Gartner) had coined the in his paper on “3-D Data Management: Controlling Data Volume, Velocity and Variety” in 2001. But now, with everybody in the industry philosophing on how to characterize BigData, it is now wonder, that we start seeing many mutations in the viral spreading of Leney’s catchy definition.