What to expect from Strata Conference 2015? An empirical outlook.

In one week, the 2015 edition of Strata Conference (or rather: Strata + Hadoop World) will open its doors to data scientists and big data practitioners from all over the world. What will be the most important big data technology trends for this year? As last year, I ran an analysis on the Strata abstract for 2015 and compared them to the previous years.

One thing immediately strikes: 2015 will be probably known as the “Spark Strata”:


If you compare mentions of the major programming languages in data science, there’s another interesting find: R seems to have a comeback and Python may be losing some of its momentum:


R is also among the rising topics if you look at the word frequencies for 2015 and 2014:


Now, let’s take a look at bigrams that have been gaining a lot of traction since the last Strata conference. From the following table, we could expect a lot more case studies than in the previous years:


This analysis has been done with IPython and Pandas. See the approach in this notebook.

Looking forward to meeting you all at Strata Conference next week! I’ll be around all three days and always in for a chat on data science.

How to analyze smartphone sensor data with R and the BreakoutDetection package

Yesterday, Jörg has written a blog post on Data Storytelling with Smartphone sensor data. Here’s a practical approach on how to analyze smartphone sensor data with R. In this example I will be using the accelerometer smartphone data that Datarella provided in its Data Fiction competition. The dataset shows the acceleration along the three axes of the smartphone:

  • x – sideways acceleration of the device
  • y – forward and backward acceleration of the device
  • z – acceleration up and down

The interpretation of these values can be quite tricky because on the one hand there are manufacturer, device and sensor specific variations and artifacts. On the other hand, all acceleration is measured relative to the sensor orientation of the device. So, for example, the activity of taking the smartphone out of your pocket and reading a tweet can look the following way:

  • y acceleration – the smartphone had been in the pocket top down and is now taken out of the pocket
  • z and y acceleration – turning the smartphone so that is horizontal
  • x acceleration – moving the smartphone from the left to the middle of your body
  • z acceleration – lifting the smartphone so you can read the fine print of the tweet

And third, there is gravity influencing all the movements.

So, to find out what you are really doing with your smartphone can be quite challenging. In this blog post, I will show how to do one small task – identifying breakpoints in the dataset. As a nice side effect, I use this opportunity to introduce an application of the Twitter BreakoutDetection Open Source library (see Github) that can be used for Behavioral Change Point analysis.

First, I load the dataset and take a look at it:

accel <- read.csv("SensorAccelerometer.csv", stringsAsFactors=F)

  user_id           x          y        z                 updated_at                 type
1      88 -0.06703765 0.05746084 9.615114 2014-05-09 17:56:21.552521 Probe::Accelerometer
2      88 -0.05746084 0.10534488 9.576807 2014-05-09 17:56:22.139066 Probe::Accelerometer
3      88 -0.04788403 0.03830723 9.605537 2014-05-09 17:56:22.754616 Probe::Accelerometer
4      88 -0.01915361 0.04788403 9.567230 2014-05-09 17:56:23.372244 Probe::Accelerometer
5      88 -0.06703765 0.08619126 9.615114 2014-05-09 17:56:23.977817 Probe::Accelerometer
6      88 -0.04788403 0.07661445 9.595961  2014-05-09 17:56:24.53004 Probe::Accelerometer

This is the sensor data for one user on one day:

accel$day <- substr(accel$updated_at, 1, 10)
df <- accel[accel$day == '2014-05-12' & accel$user_id == 88,]
df$timestamp <- as.POSIXlt(df$updated_at) # Transform to POSIX datetime
ggplot(df) + geom_line(aes(timestamp, x, color="x")) + 
             geom_line(aes(timestamp, y, color="y")) + 
             geom_line(aes(timestamp, z, color="z")) + 
             scale_x_datetime() + xlab("Time") + ylab("acceleration")


Let’s zoom in to the period between 12:32 and 13:00:

ggplot(df[df$timestamp >= '2014-05-12 12:32:00' & df$timestamp < '2014-05-12 13:00:00',]) +
  geom_line(aes(timestamp, x, color="x")) + 
  geom_line(aes(timestamp, y, color="y")) + 
  geom_line(aes(timestamp, z, color="z")) + 
  scale_x_datetime() + xlab("Time") + ylab("acceleration")


Then, I load the Breakoutdetection library:

bo <- breakout(df$x[df$timestamp >= '2014-05-12 12:32:00' & df$timestamp < '2014-05-12 12:35:00'], 
               min.size=10, method='multi', beta=.001, degree=1, plot=TRUE)


This quick analysis of the acceleration in the x direction gives us 4 change points, where the acceleration suddenly changes. In the beginning, the smartphone seems to lie flat on a horizontal surface – the sensor is reading a value of around 9.8 in positive direction – this means, the gravitational force only effects this axis and not the x and y axes. Ergo: the smartphone is lying flat. But then things change and after a few movements (our change points) the last observation has the smartphone on a position where the x axis has around -9.6 acceleration, i.e. the smartphone is being held in landscape orientation pointing to the right.

Anomaly Detection with Wikipedia Page View Data

Today, the Twitter engineering team released another very interesting Open Source R package for working with time series data: “AnomalyDetection“. This package uses the Seasonal Hybrid ESD (S-H-ESD) algorithm to identify local anomalies (= variations inside seasonal patterns) and global anomalies (= variations that cannot be explained with seasonal patterns).

As a kind of warm up and practical exploration of the new package, here’s a short example on how to download Wikipedia PageView statistics and mine them for anomalies (inspired by this blog post, where this package wasn’t available yet):

First, we install and load the necessary packages:


Then we choose an interesting Wikipedia page and download the last 90 days of PageView statistics:

page <- "USA"
raw_data <- getURL(paste("http://stats.grok.se/json/en/latest90/", page, sep=""))
data <- fromJSON(raw_data)
views <- data.frame(timestamp=paste(names(data$daily_views), " 12:00:00", sep=""), stringsAsFactors=F)
views$count <- data$daily_views
views$timestamp <- as.POSIXlt(views$timestamp) # Transform to POSIX datetime
views <- views[order(views$timestamp),]

I also did some pre-processing and transformation of the dates in POSIX datetime format. A first plot shows this pattern:

ggplot(views, aes(timestamp, count)) + geom_line() + scale_x_datetime() + xlab("") + ylab("views")


Now, let’s look for anomalies. The usual way would be to feed a dataframe with a date-time and a value column into the AnomalyDetection function AnomalyDetectionTs(). But in this case, this doesn’t work because our data is much too coarse. It doesn’t seem to work with data on days. So, we use the more generic function AnomalyDetectionVec() that just needs the values and some definition of a period. In this case, the period is 7 (= 7 days for one week):

res = AnomalyDetectionVec(views$count, max_anoms=0.05, direction='both', plot=TRUE, period=7)


In our case, the algorithm has discovered 4 anomalies. The first on October 30 2014 being an exceptionally high value overall, the second is a very high Sunday, the third a high value overall and the forth a high Saturday (normally, this day is also quite weak).