R – Beautiful Data

What to expect from Strata Conference 2015? An empirical outlook.

In one week, the 2015 edition of Strata Conference (or rather: Strata + Hadoop World) will open its doors to data scientists and big data practitioners from all over the world. What will be the most important big data technology trends for this year? As last year, I ran an analysis on the Strata abstract for 2015 and compared them to the previous years.

One thing immediately strikes: 2015 will be probably known as the “Spark Strata”:

If you compare mentions of the major programming languages in data science, there’s another interesting find: R seems to have a comeback and Python may be losing some of its momentum:

R is also among the rising topics if you look at the word frequencies for 2015 and 2014:

Now, let’s take a look at bigrams that have been gaining a lot of traction since the last Strata conference. From the following table, we could expect a lot more case studies than in the previous years:

This analysis has been done with IPython and Pandas. See the approach in this notebook.

Looking forward to meeting you all at Strata Conference next week! I’ll be around all three days and always in for a chat on data science.

How to analyze smartphone sensor data with R and the BreakoutDetection package

Yesterday, Jörg has written a blog post on Data Storytelling with Smartphone sensor data. Here’s a practical approach on how to analyze smartphone sensor data with R. In this example I will be using the accelerometer smartphone data that Datarella provided in its Data Fiction competition. The dataset shows the acceleration along the three axes of the smartphone:

x – sideways acceleration of the device
y – forward and backward acceleration of the device
z – acceleration up and down

The interpretation of these values can be quite tricky because on the one hand there are manufacturer, device and sensor specific variations and artifacts. On the other hand, all acceleration is measured relative to the sensor orientation of the device. So, for example, the activity of taking the smartphone out of your pocket and reading a tweet can look the following way:

y acceleration – the smartphone had been in the pocket top down and is now taken out of the pocket
z and y acceleration – turning the smartphone so that is horizontal
x acceleration – moving the smartphone from the left to the middle of your body
z acceleration – lifting the smartphone so you can read the fine print of the tweet

And third, there is gravity influencing all the movements.

So, to find out what you are really doing with your smartphone can be quite challenging. In this blog post, I will show how to do one small task – identifying breakpoints in the dataset. As a nice side effect, I use this opportunity to introduce an application of the Twitter BreakoutDetection Open Source library (see Github) that can be used for Behavioral Change Point analysis.

First, I load the dataset and take a look at it:

setwd("~/Documents/Datarella")
accel <- read.csv("SensorAccelerometer.csv", stringsAsFactors=F)
head(accel)

  user_id           x          y        z                 updated_at                 type
1      88 -0.06703765 0.05746084 9.615114 2014-05-09 17:56:21.552521 Probe::Accelerometer
2      88 -0.05746084 0.10534488 9.576807 2014-05-09 17:56:22.139066 Probe::Accelerometer
3      88 -0.04788403 0.03830723 9.605537 2014-05-09 17:56:22.754616 Probe::Accelerometer
4      88 -0.01915361 0.04788403 9.567230 2014-05-09 17:56:23.372244 Probe::Accelerometer
5      88 -0.06703765 0.08619126 9.615114 2014-05-09 17:56:23.977817 Probe::Accelerometer
6      88 -0.04788403 0.07661445 9.595961  2014-05-09 17:56:24.53004 Probe::Accelerometer

This is the sensor data for one user on one day:

accel$day <- substr(accel$updated_at, 1, 10)
df <- accel[accel$day == '2014-05-12' & accel$user_id == 88,]
df$timestamp <- as.POSIXlt(df$updated_at) # Transform to POSIX datetime
library(ggplot2)
ggplot(df) + geom_line(aes(timestamp, x, color="x")) + 
             geom_line(aes(timestamp, y, color="y")) + 
             geom_line(aes(timestamp, z, color="z")) + 
             scale_x_datetime() + xlab("Time") + ylab("acceleration")

Let’s zoom in to the period between 12:32 and 13:00:

ggplot(df[df$timestamp >= '2014-05-12 12:32:00' & df$timestamp < '2014-05-12 13:00:00',]) +
  geom_line(aes(timestamp, x, color="x")) + 
  geom_line(aes(timestamp, y, color="y")) + 
  geom_line(aes(timestamp, z, color="z")) + 
  scale_x_datetime() + xlab("Time") + ylab("acceleration")

Then, I load the Breakoutdetection library:

install.packages("devtools")
devtools::install_github("twitter/BreakoutDetection")
library(BreakoutDetection)
bo <- breakout(df$x[df$timestamp >= '2014-05-12 12:32:00' & df$timestamp < '2014-05-12 12:35:00'], 
               min.size=10, method='multi', beta=.001, degree=1, plot=TRUE)
bo$plot

This quick analysis of the acceleration in the x direction gives us 4 change points, where the acceleration suddenly changes. In the beginning, the smartphone seems to lie flat on a horizontal surface – the sensor is reading a value of around 9.8 in positive direction – this means, the gravitational force only effects this axis and not the x and y axes. Ergo: the smartphone is lying flat. But then things change and after a few movements (our change points) the last observation has the smartphone on a position where the x axis has around -9.6 acceleration, i.e. the smartphone is being held in landscape orientation pointing to the right.

Anomaly Detection with Wikipedia Page View Data

Today, the Twitter engineering team released another very interesting Open Source R package for working with time series data: “AnomalyDetection“. This package uses the Seasonal Hybrid ESD (S-H-ESD) algorithm to identify local anomalies (= variations inside seasonal patterns) and global anomalies (= variations that cannot be explained with seasonal patterns).

As a kind of warm up and practical exploration of the new package, here’s a short example on how to download Wikipedia PageView statistics and mine them for anomalies (inspired by this blog post, where this package wasn’t available yet):

First, we install and load the necessary packages:

library(RJSONIO)
library(RCurl)
library(ggplot2)
install.packages("devtools")
devtools::install_github("twitter/AnomalyDetection")
library(AnomalyDetection)

Then we choose an interesting Wikipedia page and download the last 90 days of PageView statistics:

page <- "USA"
raw_data <- getURL(paste("http://stats.grok.se/json/en/latest90/", page, sep=""))
data <- fromJSON(raw_data)
views <- data.frame(timestamp=paste(names(data$daily_views), " 12:00:00", sep=""), stringsAsFactors=F)
views$count <- data$daily_views
views$timestamp <- as.POSIXlt(views$timestamp) # Transform to POSIX datetime
views <- views[order(views$timestamp),]

I also did some pre-processing and transformation of the dates in POSIX datetime format. A first plot shows this pattern:

ggplot(views, aes(timestamp, count)) + geom_line() + scale_x_datetime() + xlab("") + ylab("views")

Now, let’s look for anomalies. The usual way would be to feed a dataframe with a date-time and a value column into the AnomalyDetection function AnomalyDetectionTs(). But in this case, this doesn’t work because our data is much too coarse. It doesn’t seem to work with data on days. So, we use the more generic function AnomalyDetectionVec() that just needs the values and some definition of a period. In this case, the period is 7 (= 7 days for one week):

res = AnomalyDetectionVec(views$count, max_anoms=0.05, direction='both', plot=TRUE, period=7)
res$plot

In our case, the algorithm has discovered 4 anomalies. The first on October 30 2014 being an exceptionally high value overall, the second is a very high Sunday, the third a high value overall and the forth a high Saturday (normally, this day is also quite weak).

The Top 7 Beautiful Data Blog Posts in 2014

2014 was a great year in data science – and also an exciting year for me personally from a very inspirational Strata Conference in Santa Clara to a wonderful experience of speaking at PyData Berlin to founding the data visualization company DataLion. But it also was a great year blogging about data science. Here’s the Beautiful Data blog posts our readers seemed to like the most:

Datalicious Notebookmania – My personal list of the 7 IPython notebooks I like the most. Some of them are great for novices, some can even be challenging for advanced statisticians and datascientists
Trending Topics at Strata Conferences 2011-2014 – An analysis of the topics most frequently mentioned in Strata Conference abstracts that clearly shows the rising importance of Python, IPython and Pandas.
Big Data Investment Map 2014 – I’ve been tracking and analysing the developments in Big Data investments and IPOs for quite a long time. This was the 2014 update of the network mapping the investments of VCs in Big Data companies.
Analyzing VC investment strategies with Crunchbase data – This blog post explains the code used to create the network.
How to create a location graph from the Foursquare API – In this post, I explain a way to make sense out of the Foursquare API and to create geospatial network visualizations from the data showing how locations in a city are connected via Foursquare checkins.
Text-Mining the DLD Conference 2014 – A very similar approach as I used for the Strata conference has been applied to the Twitter corpus refering to Hubert Burda Media DLD conference showing the trending topics in tech and media.
Identifying trends in the German Google n-grams corpus – This tutorial shows how to analyze Big data-sets such as the Google Book ngram corpus with Hive on the Amazon Cloud.

Querying the Bitcoin blockchain with R

The crypto-currency Bitcoin and the way it generates “trustless trust” is one of the hottest topics when it comes to technological innovations right now. The way Bitcoin transactions always backtrace the whole transaction list since the first discovered block (the Genesis block) does not only work for finance. The first startups such as Blockstream already work on ways how to use this mechanism of “trustless trust” (i.e. you can trust the system without having to trust the participants) on related fields such as corporate equity.

So you could guess that Bitcoin and especially its components the Blockchain and various Sidechains should also be among the most exciting fields for data science and visualization. For the first time, the network of financial transactions many sociologists such as Georg Simmel theorized about becomes visible. Although there are already a lot of technical papers and even some books on the topic, there isn’t much material that allows for a more hands-on approach, especially on how to generate and visualize the transaction networks.

The paper on “Bitcoin Transaction Graph Analysis” by Fleder, Kester and Pillai is especially recommended. It traces the FBI seizure of $28.5M in Bitcoin through a network analysis.

So to get you started with R and the Blockchain, here’s a few lines of code. I used the package “Rbitcoin” by Jan Gorecki.

Here’s our first example, querying the Kraken exchange for the exchange value of Bitcoin vs. EUR:

library(Rbitcoin)

## Loading required package: data.table
## You are currently using Rbitcoin 0.9.2, be aware of the changes coming in the next releases (0.9.3 - github, 0.9.4 - cran). Do not auto update Rbitcoin to 0.9.3 (or later) without testing. For details see github.com/jangorecki/Rbitcoin. This message will be removed in 0.9.5 (or later).

wait <- antiddos(market = 'kraken', antispam_interval = 5, verbose = 1)
market.api.process('kraken',c('BTC','EUR'),'ticker')

##    market base quote           timestamp market_timestamp  last     vwap
## 1: kraken  BTC   EUR 2015-01-02 13:12:03             <NA&gt; 263.2 262.9169
##      volume    ask    bid
## 1: 458.3401 263.38 263.22

The function antiddos makes sure that you’re not overusing the Bitcoin API. A reasonable query interval should be one query every 10s.

Here’s a second example that gives you a time-series of the lastest exchange values:

trades <- market.api.process('kraken',c('BTC','EUR'),'trades')
Rbitcoin.plot(trades, col='blue')

The last two examples all were based on aggregated values. But the Blockchain API allows to read every single transaction in the history of Bitcoin. Here’s a slightly longer code example on how to query historical transactions for one address and then mapping the connections between all addresses in this strand of the Blockchain. The red dot is the address we were looking at (so you can change the value to one of your own Bitcoin addresses):

wallet <- blockchain.api.process('15Mb2QcgF3XDMeVn6M7oCG6CQLw4mkedDi')
seed <- '1NfRMkhm5vjizzqkp2Qb28N7geRQCa4XqC'
genesis <- '1A1zP1eP5QGefi2DMPTfTL5SLmv7DivfNa'
singleaddress <- blockchain.api.query(method = 'Single Address', bitcoin_address = seed, limit=100)
txs <- singleaddress$txs

bc <- data.frame()
for (t in txs) {
  hash <- t$hash
  for (inputs in t$inputs) {
    from <- inputs$prev_out$addr
    for (out in t$out) {
      to <- out$addr
      va <- out$value
      bc <- rbind(bc, data.frame(from=from,to=to,value=va, stringsAsFactors=F))
    }
  }
}

After downloading and transforming the blockchain data, we’re now aggregating the resulting transaction table on address level:

library(plyr)
btc <- ddply(bc, c("from", "to"), summarize, value=sum(value))

Finally, we’re using igraph to calculate and draw the resulting network of transactions between addresses:

library(igraph)
btc.net <- graph.data.frame(btc, directed=T)
V(btc.net)$color <- "blue"
V(btc.net)$color[unlist(V(btc.net)$name) == seed] <- "red"
nodes <- unlist(V(btc.net)$name)
E(btc.net)$width <- log(E(btc.net)$value)/10
plot.igraph(btc.net, vertex.size=5, edge.arrow.size=0.1, vertex.label=NA, main=paste("BTC transaction network for\n", seed))

2014 highlight (1): Statistical Learning course by Hastie & Tibshirani

What I like most about the R and Python developer and user communities, is their incredible openness and generosity. One of the finest examples in the past year was the online course “Statistical Learning” taught by Stanford professors Trevor Hastie and Rob Tibshirani.

statlearning_screenshot

In this MOOC they explain very understandably (even for beginners) the basics of statistical modeling (or machine learning) techniques such as linear, polynomial and logistic regression, smoothing splines, Ridge regression, Lasso, Generalized Additive Models, various methods for classification (Classification Trees to random forests) and also unsupervised learning methods.

But the highlights of this course are the R labs between all units. In these sessions, the statistical theory is supplemented with many practical examples. It’s really fantastic to hear the authors explain and teach the (very essential) R packages they wrote themselves. For me, the course was also an impetus to learn even more about knitr. Especially if you’re used to IPython notebook, this combination of code and output can be very intuitive and useful. Even months after going through the course, I refer to my lab R code (see also here) when I need some quick templates for common statistical modelling tasks. I really liked the strong focus on cross validation methods – many basic courses on statistics focus only on the methods and not on how to estimate how well you’re predicting.

ISL Cover 2 The course is based on the textbook “Introduction to Statistical Learning” (or short: ISL, download here) Hastie and Tibshirani wrote together with Gareth James and Daniela Witten. If you want to dive even deeper into the subject, you can also work through the more advanced work “Elements of Statistical Learning” (ESL, download).

So, if one of your New Year resolutions for 2015 is, learning how to do more with R, you should definitively take a look at this course. The next free class is starting on January 19.

Continue with part 2 of the 2014 highlights.

Analyzing VC investment strategies with Crunchbase data

If you look at the investments in Big Data companies in the last few years, one thing is obvious: This is a very dynamic and fast growing market. I am producing regular updates of this network map of Big Data investments with a Python program (actually an IPython Notebook).

But what insights can be gained by directly analyzing the Crunchbase investment data? Today I revved up my RStudio to take a closer look at the data beneath the nodes and links.

Load the data and required packages:

library(reshape)
library(plyr)
library(ggplot2)
data <- read.csv('crunchbase_monthly_export_201403_investments.csv', sep=';', stringsAsFactors=F)
inv <- data[,c("investor_name", "company_name", "company_category_code", "raised_amount_usd", "investor_category_code")]
inv$raised_amount_usd[is.na(inv$raised_amount_usd)] <- 1

In the next step, we are selecting only the 100 top VC firms for our analysis:

inv <- inv[inv$investor_category_code %in% c("finance", ""),]
top <- ddply(inv, .(investor_name), summarize, sum(raised_amount_usd))
names(top) <- c("investor_name", "usd")
top <- top[order(top$usd, decreasing=T),][1:100,]
invtop <- inv[inv$investor_name %in% top$investor_name[1:100],]

Right now, each investment from a VC firm to a Big Data company is one row. But to analyze the similarities between the VC companies in term of their investment in the various markets, we have to transform the data into a matrix. Fortunately, this is exactly, what Hadley Wickham’s reshape package can do for us:

inv.mat <- cast(invtop[,1:4], investor_name~company_category_code, sum)
inv.names <- inv.mat$investor_name
inv.mat <- inv.mat[,3:40] # drop the name column and the V1 column (unknown market)

These are the most important market segments in the Crunchbase (Top 100 VCs only):

inv.seg <- ddply(invtop, .(company_category_code), summarize, sum(raised_amount_usd))
names(inv.seg) <- c("Market", "USD")
inv.seg <- inv.seg[inv.seg$Market != "",]
inv.seg$Market <- as.factor(inv.seg$Market)
inv.seg$Market <- reorder(inv.seg$Market, inv.seg$USD)
ggplot(inv.seg, aes(Market, USD/1000000))+geom_bar(stat="identity")+coord_flip()+ylab("$1M USD")

plot of chunk unnamed-chunk-4

What’s interesting now: Which branches are related to each other in terms of investments (e.g. VCs who invested in biotech also invested in cleantech and health …). This question can be answered by running the data through a K-means cluster analysis. In order to downplay the absolute differences between the categories, I am using the log values of the investments:

inv.market <- log(t(inv.mat))
inv.market[inv.market == -Inf] <- 0

fit <- kmeans(inv.market, 7, nstart=50)
pca <- prcomp(inv.market)
pca <- as.matrix(pca$x)
plot(pca[,2], pca[,1], type="n", xlab="Principal Component 1", ylab="Principal Component 2", main="Market Segments")
text(pca[,2], pca[,1], labels = names(inv.mat), cex=.7, col=fit$cluster)

plot of chunk unnamed-chunk-5

My 7 cluster solution has identified the following clusters:

Health
Cleantech / Semiconductors
Manufacturing
News, Search and Messaging
Social, Finance, Analytics, Advertising
Automotive & Sports
Entertainment

The same can of course be done for the investment firms. Here the question will be: Which clusters of investment strategies can be identified? The first variant has been calculated with the log values from above:

inv.log <- log(inv.mat)
inv.log[inv.log == -Inf] <- 0
inv.rel <- scale(inv.mat)

fit <- kmeans(inv.log, 6, nstart=15)
pca <- prcomp(inv.log)
pca <- as.matrix(pca$x)
plot(pca[,2], pca[,1], type="n", xlab="Principal Component 1", ylab="Principal Component 2", main="VC firms")
text(pca[,2], pca[,1], labels = inv.names, cex=.7, col=fit$cluster)

plot of chunk unnamed-chunk-6

The second variant uses scaled values:

inv.rel <- scale(inv.mat)

fit <- kmeans(inv.rel, 6, nstart=15)
pca <- prcomp(inv.rel)
pca <- as.matrix(pca$x)
plot(pca[,2], pca[,1], type="n", xlab="Principal Component 1", ylab="Principal Component 2", main="VC firms")
text(pca[,2], pca[,1], labels = inv.names, cex=.7, col=fit$cluster)

plot of chunk unnamed-chunk-7

Animated Twitter Networks

In this blogpost I presented a visualization made with R that shows how almost the whole world expresses its attention to political crises abroad. Here’s another visualization with Tweets in October 2013 that referred to the Lampedusa tragedy in the Mediterranean.

But this transnational public space isn’t quite as static as it seems on these images. To show how these geographical hashtag links develop over time, I analyzed the timestamps of the (geo-coded) Tweets mentioning the hashtag #lampedusa. This is the resulting animation showing the choreography of global solidarity:

The code is quite straightforward. After collecting the Tweets via the Twitter Streaming API e.g. with Pablo Barberá’s R package streamR, I quantized the dates to hourly values and then calculated the animation frame by frame inspired by Jeff Hemsley’s approach.

One trick that is very helpful when plotting geospatial connections with great circles is the following snippet that correctly assembles lines that cross the dateline:
library("geosphere") for (i in 1:length(l$long)) { inter <- gcIntermediate(c(l$long[i], l$lat[i]), c(12.6, 35.5), n=500, addStartEnd=TRUE, breakAtDateLine=TRUE) if (length(inter) > 2) { lines(inter, col=col, lwd=l$n[i]) } else { lines(inter[[1]], col=col, lwd=l$n[i]) lines(inter[[2]], col=col, lwd=l$n[i]) } }

Mining Research Interests – or: What Would Google Want to Know?

I am a regular visitor of Google’s research page where they post all of their latest and upcoming scientific papers. Lately I have thought whether it would be possible to statistically extract some of the meta-information from the papers. Here’s the result of the analysis of the papers’ titles produced with just a few lines of R code:

I clustered the data with a standard hierarchical cluster analysis to find out which terms tend to often go together in the paper titles. Then I took a deeper look at the abstracts – of all the papers that had abstracts that is. I processed the abstracts with the tm R package and draw the following heat-map that shows how often which of the most important keywords appear in each paper:

I did a similar heatmap but this time normalized by the term frequency – inverse document frequency measure. While the first heatmap shows the most frequently used terms, this weighted heatmap shows terms that are quite important in their respective research papers but normalizes this by the overall term frequency.

If you need input for playing buzzword bingo at the next Strata Conference in Santa Clara, you don’t have to look any further 😉

Cosmopolitan Public Spaces

Mentions of the Gezi Park protests on Twitter

In my PhD and post-doc research projects at the university, I did a lot of research on the new cosmopolitanism together with Ulrich Beck. Our main goal was to test the hypothesis of an “empirical cosmopolitanization”. Maybe the term is confusing and too abstract, but what we were looking for were quite simple examples for ties between humans that undermine national borders. We were trying to unveil the structures and processes of a real-existing cosmopolitanism.

I looked at a lot of statistics on transnational corporations and the evolution of transnational economic integration. But one of the most exciting dimensions of the theory of cosmopolitanism is the rise of a cosmopolitan public sphere. This is not the same as a global public that can be found in features such as world music, Hollywood blockbusters or global sports events. A cosmopolitan public sphere refers to solidarity with other human beings.

When I discovered the discussions on Twitter about the Gezi Park protests in Istanbul, this kind of cosmopolitan solidarity seems to assume a definite form: The lines that connect people all over Europe with the Turkish protesters are not the usual international relations, but they are ties that e.g. connect Turkish emigrants, political activists, “Wutbürger” or generally political aware citizens with the events in Istanbul. Because only about 1% of all tweets carry information about the geo-position of the user, you should imagine about 100 times more lines to see the true dimension of this phenomenon.

Mapping a Revolution

Twitter has become an important communications tool for political protests. While mass media are often censored during large-scale political protests, Social Media channels remain relatively open and can be used to tell the world what is happening and to mobilize support all over the world. From an analytic perspective tweets with geo information are especially interesting.

Here’s some maps I did on the basis of ~ 6,000 geotagged tweets from ~ 12 hours on 1 and 2 Jun 2013 referring to the “Gezi Park Protests” in Istanbul (i.e. mentioning the hashtags “occupygezi”, “direngeziparki”, “turkishspring”* etc.). The tweets were collected via the Twitter streaming API and saved to a CouchDB installation. The maps were produced by R (unfortunately the shapes from the map package are a bit outdated).

*”Turkish Spring” or “Turkish Summer” are misleading terms as the situation in Turkey cannot be compared to the events during the “Arab Spring”. Nonetheless I have included them in my analysis because they were used in the discussion (e.g. by mass media twitter channels) Thanks @Taksim for the hint.

International Attention for Gezi Park protests 1-2 Jun

On the next day, there even was one tweet mentioning the protests crossing the dateline:

International Attention for Gezi Park protests 1-3 Jun

First, I took a look at the international attention (or even cosmopolitan solidarity) of the events in Turkey. The following maps are showing geotagged tweets from all over the world and from Europe that are referring to the events. About 1% of all tweets containing the hashtags carry exact geographical coordinates. The fact, that there are so few tweets from Germany – a country with a significant population of Turkish immigrants – should not be overrated. It’s night-time in Germany and I would expect a lot more tweets tomorrow.

European Attention for Gezi Park protests 1-2 Jun

14,000 geo-tagged tweets later the map looks like this:

European Attention for Gezi Park protests 1-3 Jun

The next map is zooming in closer to the events: These are the locations in Turkey where tweets were sent with one of the hashtags mentioned above. The larger cities Istanbul, Ankara and Izmir are active, but tweets are coming from all over the country:

Turkish Tweets about the Gezi Park protests 1-2 Jun

On June 3rd, the activity has spread across the country:

Turkish Tweets about the Gezi Park protests 1-3 Jun

And finally, here’s a look at the tweet locations in Istanbul. The map is centered on Gezi Park – and the activity on Twitter as well:

Istanbul Tweets about Gezi Park protests 1-2 Jun

Here’s the same map a day later (I decreased the size of the dots a bit while the map is getting clearer):

Istanbul Tweets about Gezi Park protests 1-3 Jun

The R code to create the maps can be found on my GitHub.

Color analysis of Flickr images

Since I’ve seen this beautiful color wheel visualizing the colors of Flickr images, I’ve been fascinated with large scale automated image analysis. At the German Market Research association’s conference in late April, I presented some analyses that went in the same direction (click to enlarge):

Color values of Flickr images from Germany

On the image above you can see the color values ordered by their hue from images taken in Germany between August 2010 and April 2013. Each row represents the aggregation of 2.000 images downloaded from the Flickr API. I did this with the following R code:
bbox <- "5.866240,47.270210,15.042050,55.058140" pages <- 10 maxdate <- "2010-08-31" mindate <- "2010-08-01" for (i in 1:pages) { api <- paste("http://www.flickr.com/services/rest/?method=flickr.photos.search&format=json&api_key=YOUR_API_KEY_HERE &nojsoncallback=1&page=", i, "&per_page=500&bbox=", bbox, "&min_taken_date=", mindate, "&max_taken_date=", maxdate, sep="") raw_data <- getURL(api, ssl.verifypeer = FALSE) data <- fromJSON(raw_data, unexpected.escape="skip", method="R") # This gives a list of the photo URLs including the information # about id, farm, server, secret that is needed to download # them from staticflickr.com }
To aggregate the color values, I used Vijay Pandurangans Python script he wrote to analyze the color values of Indian movie posters. Fortunately, he open sourced the code and uploaded it on GitHub (thanks, Vijay!)

The monthly analysis of Flickr colors clearly hints at seasonal trends, e.g. the long and cold winter of 2012/2013 can be seen in the last few rows of the image. Also, the soft winter of 2011/2012 with only one very cold February appears in the image.

To take the analysis even further, I used weather data from the repository of the German weather service and plotted the temperatures for the same time frame:

Could this be the same seasonality? To find out how the image color values above and the temperature curve below are related, I calculated the correlation between the dominance of the colors and the average temperature. Each month can not only be represented as a hue band, but also as a distribution of colors, e.g. the August 2010 looks like this:

So there’s a percent value for each color and each month. When I correlated the temperature values and the color values, the colors with the highest correlations were green (positive) and grey (negative). So, the more green is in a color band, the higher the average temperature in this month. This is how the correlation looks like:

The model actually is pretty good:
> fit <- lm(temp~yellow, weather) > summary(fit) Call: lm(formula = temp ~ yellow, data = weather) Residuals: Min 1Q Median 3Q Max -5.3300 -1.7373 -0.3406 1.9602 6.1974 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -5.3832 1.2060 -4.464 0.000105 *** yellow 2.9310 0.2373 12.353 2.7e-13 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 2.802 on 30 degrees of freedom Multiple R-squared: 0.8357, Adjusted R-squared: 0.8302 F-statistic: 152.6 on 1 and 30 DF, p-value: 2.695e-13

Of course, it can even be improved a bit by calculating it with a polynomial formula. With second order polynomials lm(temp~poly(yellow,2), weather), we even get a R-squared value of 0.89. So, even when the pictures I analysed are not always taken outside, there seems to be a strong relationship between the colors in our Flickr photostreams and the temperature outside.

Why the 2012 US elections are more exciting than 2008

Here’s an addition to my last post on using Wikipedia data to analyse attention for the US presidential elections 2012. Here’s another look at the interest not for the candidates’ Wikipedia pages but the general pages for the elections 2008 and 2012. Compared to the candidates’ pages, the attention for the general election page is much lower than for the candidates. Here’s the average values for October 2012:

Mitt Romney (2012): 98,138 Views / day
Barack Obama (2012): 63,104 Views / day
United States presidential election, 2012 (2012): 38,770
United States presidential election, 2008 (2008): 27,907

This monthly average hints at the 2012 elections being very exciting as the general election pages on Wikipedia have seen a 39% traffic increase compared to last elections. This also hold for the following time-series:

US Presidential Elections 2012 vs. 2008 (Wikipedia, daily visits)

While the attention for the election pages in 2012 did not reach the level it had during the 2008 primaries, from mid October the 2012 campaigns were much more interesting according to the Wikipedia numbers. In 2008 we have seen a drop in attention before election day, in 2012 the suspense seems to build up.

Wikipedia Attention and the US elections

One of the most interesting challenges of data science are predictions for important events such as national elections. With all those data streams of billions of posts, comments, likes, clicks etc. there should be a way to identify the most important correlations to make predictions about real-world behavior such as: going to the voting booth and chosing a candidate.

A very interesting data source in this respect is the Wikipedia. Why? Because Wikipedia is

a) open (data on page-views, edits, discussions are freely available on daily or even hourly basis),
b) huge (WP currently ranks as #6 of all web sites worldwide and reaches about a quarter of all online users),
c) specific (people visit the Wikipedia because they want to know something about some topic)

The first step was comparing the candidates Barack Obama and Mitt Romney over time. The resulting graph clearly shows the pivoting points of Obama’s presidential career (click to zoom):

Obama vs. Romney 2009-2012 (Wikipedia data)

But it also shows how strong Mitt Romney has been since the Republican primaries in January 2012. His Wikipedia page had attracted a lot more visitors in August and September 2012 than his presidential rival’s. Of course, this measure only shows attention, not sentiment. So it cannot be inferred from this data whether the peaks were positive or negative peaks. In terms of Wikipedia attention, Romney’s infamous 47% comments in September 2012 were more than 1/3 as important as Obama’s inauguration in January 2009.

Now, let’s add some further curves to this graph: Obama’s and McCain’s Wikipedia attention during the last elections:

Obama vs. Romney 2012 compared to Obama vs. McCain 2008 (Wikipedia data)

Here’s another version with weekly data:

Obama vs. Romney 2012 compared to Obama vs. McCain 2008 (Wikipedia data, weeks)

It’s almost instantly clear how much more attention Obama’s 2008 campaign (in red) gathered in comparison with his 2012 campaign (in green). On the other hand, Mitt Romney is at least when it comes to Wikipedia attention more interesting than McCain had been.

Here’s a comparison of Obama’s 2008 campaign vs. his 2012 campaign:

Obama 2008 vs. Obama 2012 (Wikipedia data)

The last question: Is Mitt Romney 2012 as strong as Obama had been in 2008? Here’s a direct comparison:

Obama 2008 vs. Romney 2012 (Wikipedia data, weekly)

A side-remark: I also did a correlation of this data set with Google Correlate. And guess what: The strongest correlation of the data for Obama’s 2012 campaign is the Google search query for “barack obama wikipedia”. There still seem to be a huge number of people using Google as their Wikipedia search-engine.

Google Correlate result for the Wikipedia time series "Barack Obama" — Google Correlate result for the Wikipedia time series “Barack Obama”

But this result could also be interpreted the other way round: If there is a strong correlation between Wikipedia usage and Google search queries, this makes Wikipedia an even more important data source for analyses.

Twitter Germany will be based in Berlin – Taking a look at the numbers

What I really love about Twitter is that everything they do seems to be data-based. They’re so data-driven, they even analyze the ingredients of their lunch to ensure everyone at the company is living a healthy lifestyle. So, the decision for Berlin as their German headquarter cannot be a random or value-based decision. I bet, there’s been a lot of numbers crunching before announcing their new office. Let’s try and reverse-engineer this decision.

As a data basis I collected 4,377,832 tweets more or less randomly by connecting to the streaming API. Then I pulled all users mentioning one of the 30 leading German cities from Berlin to Aachen in their location field. Where there were Umlauts involved, I allowed for multiple variants, e.g. “Muenchen”, “Munchen” or “Munich” for “München”. Now I have 3,696 Twitter users from Germany that posted one or more tweets during the sample interval. That’s 0.08% of the original sample size. Although that’s not as much as I would have expected, let’s continue with the analysis.

The first interesting thing is the distribution of the Twitter users by their cities. Here’s the result:

One thing should immediately be clear from this chart: Only Berlin, Hamburg and Munich had a real chance of becoming Twitter’s German HQ. The other cities are just Twitter ghost towns. In the press, there had been some buzz about Cologne, but from these numbers, I’d say that could only have been desinformation or whishful thinking.

The next thing to look at is the influence of Twitter users in different German cities. Here’s a look at the follower data:

This does not help a lot. The distribution is heavily distorted by the outliers: Some Twitter users have a lot more followers than others. These Twitter users are marked by the black dots above the cities. But one thing is interesting: Berlin, Hamburg and Munich not only have the most Twitter users in our sample, but also the most and the highest outliers. With the outliers removed, the chart looks like this:

The chart not only shows the median number of followers, but also the distribution of the data. Berlin, that should be clear from this chart, is not the German city where the Twitter users with most followers hail from. This should be awarded to Bochum (355 followers), Nuremberg (258 followers) or Augsburg (243 followers). But these numbers are not very reliable as the number of cases is quite low for these cities. If we focus on the Big 3, then Berlin is leading with 223 followers, then Munich with 209 followers and finally Hamburg with 200 followers. But it’s a very close race.

Next up, the number of friends. Which German city is leading the average number of friends on Twitter?

This chart is also distorted by outliers, but here it’s different cities: The user in the sample who is following the largest number of friends is located in Bielefeld. Of all things! Now, let’s remove the outliers:

The cities with the larges average number of friends are: Bochum (again! 286 friends), Wiesbaden (224 friends) and Leipzig (208 friends). Our Big 3 are performing as follows: Berlin (183 friends), Hamburg (183 friends) and Munich (160 friends). Let’s take a look at the relation between followers and friends:

If we zoom in a bit on the data we can reproduce the “2000 phenomenon”:

There clearly is some kind of artificial barrier at 2,000 friends on Twitter. Accounts that have between 100 and 2,000 followers never follow more than 2,000 followers. Most frequently, they follow just a little below of 2,000 people. After they gathered 2,000 followers themselves, this barrier has been broken and the maximal number of friends seems to grow with the number of followers. There’s only speculation about this phenomenon, but one of the most convincing explanation is: We are looking at spam bots that are programmed to stay below 2,000 friends until they have gathered more than 2,000 followers. Maybe Twitter has some spam fighting algorithms that are focusing at the 2,000 line. Update: See explanation in the comments to this article: Behind this anomaly is Twitter’s spam-fighting barrier that only allows 2,000 friends up to 2,000 followers. Beyond this, the limit for the maximum number of friends is limited by the number of followers + 10%.

If those users are bots, then which city is bot capital? Let’s take a look at all Twitter users that have between 1,900 and 2,100 friends and segment them by city:

Again, Berlin is leading. But how do these numbers relate to the total numbers? Here’s the Bot Score for these cities: Berlin 2.3%, Hamburg 1.8% and Munich 1.2%. That’s one clear point for Munich.

Finally, let’s take a look at Twitter statuses in these cities. Where do the most active Twitter users tweet from? Here’s a look at the full picture including outliers:

The city with the most active Twitter user surprisingly is not Bochum or Berlin, but Düsseldorf. And also Stuttgart seems to be very hot in this regard. But to really learn about activity, we have to remove the outliers again:

Without outliers, the most active Twitter cities in Germany are: Bochum (again!! 5514 statuses), Karlsruhe (4973) and Augsburg (4254). The Big 3 are in the midfield: Berlin (2845), Munich (2717) and Hamburg (2638).

Finally, there’s always the content. What are the users in the Big 3 cities talking about? The most frequently twittered words do not differ very much. In all three cities, “RT” is the most important word followed by a lot of words like “in”, “the” or “ich” that don’t tell much about the topics. It is much more interesting to look at word pairs (and especially at the pairs with the highest point wise mutual information (PMI). In Berlin, people are talking about “neues Buch” (new book – it’s a city of literature), “gangbang erotik” (hmm) and “nasdaq dow” (financial information seem to be important). In Munich, it’s “reise reisen” (Munich seems to love traveling), “design products” (very design oriented city) and “prost bier” (it’s a cliche, but it seems to be true). Compare this with Hamburg’s “amazon preis” (people looking for low prices), “social media” (Hamburg has a lot of online agencies) and “dvd blueray” (people watching a lot of TV).

Wrapping up, here are the final results:

          Berlin Munich Hamburg
Users          3      1       2
Followers      3      2       1
Friends        2      1       2
Bots          -3     -1      -2
Statuses       3      2       1
TOTAL          8      5       4

Congrats to Berlin!

[The R code that generated all the charts above can be found on my github.]