What I really love about Twitter is that everything they do seems to be data-based. They’re so data-driven, they even analyze the ingredients of their lunch to ensure everyone at the company is living a healthy lifestyle. So, the decision for Berlin as their German headquarter cannot be a random or value-based decision. I bet, there’s been a lot of numbers crunching before announcing their new office. Let’s try and reverse-engineer this decision.

As a data basis I collected 4,377,832 tweets more or less randomly by connecting to the streaming API. Then I pulled all users mentioning one of the 30 leading German cities from Berlin to Aachen in their location field. Where there were Umlauts involved, I allowed for multiple variants, e.g. “Muenchen”, “Munchen” or “Munich” for “München”. Now I have 3,696 Twitter users from Germany that posted one or more tweets during the sample interval. That’s 0.08% of the original sample size. Although that’s not as much as I would have expected, let’s continue with the analysis.

The first interesting thing is the distribution of the Twitter users by their cities. Here’s the result:

Twitter users by city

One thing should immediately be clear from this chart: Only Berlin, Hamburg and Munich had a real chance of becoming Twitter’s German HQ. The other cities are just Twitter ghost towns. In the press, there had been some buzz about Cologne, but from these numbers, I’d say that could only have been desinformation or whishful thinking.

The next thing to look at is the influence of Twitter users in different German cities. Here’s a look at the follower data:

Average numbers of followers by city

This does not help a lot. The distribution is heavily distorted by the outliers: Some Twitter users have a lot more followers than others. These Twitter users are marked by the black dots above the cities. But one thing is interesting: Berlin, Hamburg and Munich not only have the most Twitter users in our sample, but also the most and the highest outliers. With the outliers removed, the chart looks like this:

Average number of followers by city

The chart not only shows the median number of followers, but also the distribution of the data. Berlin, that should be clear from this chart, is not the German city where the Twitter users with most followers hail from. This should be awarded to Bochum (355 followers), Nuremberg (258 followers) or Augsburg (243 followers). But these numbers are not very reliable as the number of cases is quite low for these cities. If we focus on the Big 3, then Berlin is leading with 223 followers, then Munich with 209 followers and finally Hamburg with 200 followers. But it’s a very close race.

Next up, the number of friends. Which German city is leading the average number of friends on Twitter?

Average number of friends by city

This chart is also distorted by outliers, but here it’s different cities: The user in the sample who is following the largest number of friends is located in Bielefeld. Of all things! Now, let’s remove the outliers:

Average number of friends by city

The cities with the larges average number of friends are: Bochum (again! 286 friends), Wiesbaden (224 friends) and Leipzig (208 friends). Our Big 3 are performing as follows: Berlin (183 friends), Hamburg (183 friends) and Munich (160 friends). Let’s take a look at the relation between followers and friends:

Followers x Friends

If we zoom in a bit on the data we can reproduce the “2000 phenomenon”:
2000 phenomenon

There clearly is some kind of artificial barrier at 2,000 friends on Twitter. Accounts that have between 100 and 2,000 followers never follow more than 2,000 followers. Most frequently, they follow just a little below of 2,000 people. After they gathered 2,000 followers themselves, this barrier has been broken and the maximal number of friends seems to grow with the number of followers. There’s only speculation about this phenomenon, but one of the most convincing explanation is: We are looking at spam bots that are programmed to stay below 2,000 friends until they have gathered more than 2,000 followers. Maybe Twitter has some spam fighting algorithms that are focusing at the 2,000 line. Update: See explanation in the comments to this article: Behind this anomaly is Twitter’s spam-fighting barrier that only allows 2,000 friends up to 2,000 followers. Beyond this, the limit for the maximum number of friends is limited by the number of followers + 10%.

If those users are bots, then which city is bot capital? Let’s take a look at all Twitter users that have between 1,900 and 2,100 friends and segment them by city:

Twitter users by city

Again, Berlin is leading. But how do these numbers relate to the total numbers? Here’s the Bot Score for these cities: Berlin 2.3%, Hamburg 1.8% and Munich 1.2%. That’s one clear point for Munich.

Finally, let’s take a look at Twitter statuses in these cities. Where do the most active Twitter users tweet from? Here’s a look at the full picture including outliers:

Average number of statuses by city

The city with the most active Twitter user surprisingly is not Bochum or Berlin, but Düsseldorf. And also Stuttgart seems to be very hot in this regard. But to really learn about activity, we have to remove the outliers again:

Average number of statuses by city

Without outliers, the most active Twitter cities in Germany are: Bochum (again!! 5514 statuses), Karlsruhe (4973) and Augsburg (4254). The Big 3 are in the midfield: Berlin (2845), Munich (2717) and Hamburg (2638).

Finally, there’s always the content. What are the users in the Big 3 cities talking about? The most frequently twittered words do not differ very much. In all three cities, “RT” is the most important word followed by a lot of words like “in”, “the” or “ich” that don’t tell much about the topics. It is much more interesting to look at word pairs (and especially at the pairs with the highest point wise mutual information (PMI). In Berlin, people are talking about “neues Buch” (new book – it’s a city of literature), “gangbang erotik” (hmm) and “nasdaq dow” (financial information seem to be important). In Munich, it’s “reise reisen” (Munich seems to love traveling), “design products” (very design oriented city) and “prost bier” (it’s a cliche, but it seems to be true). Compare this with Hamburg’s “amazon preis” (people looking for low prices), “social media” (Hamburg has a lot of online agencies) and “dvd blueray” (people watching a lot of TV).

Wrapping up, here are the final results:

          Berlin Munich Hamburg
Users          3      1       2
Followers      3      2       1
Friends        2      1       2
Bots          -3     -1      -2
Statuses       3      2       1
TOTAL          8      5       4

Congrats to Berlin!

[The R code that generated all the charts above can be found on my github.]

 

17 Responses to Twitter Germany will be based in Berlin – Taking a look at the numbers

  1. drikkes says:

    Wenn es nach den reinen Zahlen ginge, müßte Twitter einen großen Bogen um ganz Deutschland machen. http://gigaom.com/2012/03/27/what-does-twitter-want-with-germany/

  2. At first I thought that that data can’t be valid. But looking at the authors I better shut up and simply remain surprised about the outcome.

  3. Benedikt Koehler says:

    @Hendrik Actually, with the streaming API we do not know which selection of tweets we get. It should deliver ~1% of all public statuses but only displays statuses that have passed a “quality filter”. See here for the sampling procedure: https://dev.twitter.com/docs/streaming-api/concepts

  4. uelzer says:

    Well, it seems to me your basic argument is: Twitter is moving to Berlin because there the most people using twitter are living. That’s not a surprise looking at the population of the cities.
    berlin 3.5 mio
    hamburg 1.6 mio
    munich 0.8 mio

    http://en.wikipedia.org/wiki/List_of_cities_in_Germany_by_population

  5. Mitch says:

    wow… mega klasse ausgearbeitet… Mein Respekt :)
    Verdient auf alle Fälle ein Trackback!

  6. Benedikt Koehler says:

    @uelzer I just calculated the regressions: There is no significant relation between the size of the German cities and the number of followers, friends or statuses of the Twitter users in the cities.

  7. Maybe I’m blind and missed it, but I can’t find a mention of you looking at actual geo data, only mentions of certain words. Is that correct?

    As a mention doesn’t have to correlate with actual location of the twitterer, did you look at where those tweets originated from (explicit geolocation) and where those twitter profiles are registered as at (accounting for national, international and local city name would be interesting too) and see if those numbers align?

  8. Mahnny says:

    @uelzer, you quoted the population numbers of 1950.

    I think, many users note Berlin, Hamburg, Munich as their location whether they only live _near_ these cities.

    And finally: No statistics is valid that doesn’t include the city of Erlangen :-)

  9. Schicke Grafiken und logisches Vorgehen! Nur fehlt mir hier der qualitative Aspekt. Hast du alle User zwischen 1.900 und 2.100 Friends wirklich als Bots kategorisiert? Ich kann den Gedankengang nachvollziehen, aber auf Arbeit sitzt mir ne Kollegin mit 2.000 Friends gegenüber und die ist echt :)

    Aber das ist wohl einfach so bei Big Data …

    Ich bin gespannt auf weitere Analysen!

  10. Benedikt Koehler says:

    @Vidar No, I just looked at the location and description fields of the Twitter profiles. Only 1% of the Twitter users that had one of the cities in their location fields have enabled geo-information in their tweets. The relation of people who have Berlin as their geo-location to the people who have it written in their profiles is 1:36.

  11. Just for fun. How would the #Pott perform if you add, Bochum, Essen, Duisburg and Dortmund. You may even add Dueseldorf although neither people from Duesseldorf nor from the Pott would like to see been mixed up. :-)

  12. … the 2,000 follower barrier:
    it is because of the many spammers that Twitter inserted a a barrier at 2,000 following. Twitter only allows you to follow 2,000 people. After this you need more people following you. If you reach 1.820 people following you can reach 2,001 people, similarly: 5,000 follower = 5,500 following, it is a 10% rule.

    Thanks for the great analysis,
    Dietmar

    Ich habe noch eine neue Info gefunden.Twitter erlaubt es max. 2001 Leuten zu folgen. Erst wenn man selber 1820 Leute hat, die einem folgen, darf man selber wieder mehr folgen. Es gilt hier eine 10 % Grenze. 1820 Follower= 2001 Following.5000 Follower=5500 Following. Es sollte also der 10 % Abstand nicht unterschritten werden.

  13. Benedikt Koehler says:

    @Dietmar Thanks a lot! I updated the paragraph in the article.

  14. @Benedikt Thanks for the clarification! Interesting insights. That’s a very low number. It would be interesting to see how it compares internationally.

  15. OliverG says:

    Bielefeld is a Joke by veteran usenetters, right? Or did all CCC-Members claim they were in Iran… ehm Bielefeld?

    Something like this ;)

  16. Klaus Jan says:

    (late comment ;-)
    thx for this nice introduction into big data. But I do’nt think Twitters decision is based on those data – it’s just the popularity of Berlin in the English speaking world. Berlin has a place in pop culture, and David Bowie and Leonard Cohen did more for that then M. Wowereit. That plays a role in a business with such a high affinity to pop culture. Sure, the data are reflecting that….
    Cologne is only the funniest place in an area with tons of middle sized colorless cities + a geographical setting in the heart of Europe.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>