conference – Beautiful Data

What to expect from Strata Conference 2015? An empirical outlook.

In one week, the 2015 edition of Strata Conference (or rather: Strata + Hadoop World) will open its doors to data scientists and big data practitioners from all over the world. What will be the most important big data technology trends for this year? As last year, I ran an analysis on the Strata abstract for 2015 and compared them to the previous years.

One thing immediately strikes: 2015 will be probably known as the “Spark Strata”:

If you compare mentions of the major programming languages in data science, there’s another interesting find: R seems to have a comeback and Python may be losing some of its momentum:

R is also among the rising topics if you look at the word frequencies for 2015 and 2014:

Now, let’s take a look at bigrams that have been gaining a lot of traction since the last Strata conference. From the following table, we could expect a lot more case studies than in the previous years:

This analysis has been done with IPython and Pandas. See the approach in this notebook.

Looking forward to meeting you all at Strata Conference next week! I’ll be around all three days and always in for a chat on data science.

Coolhunting like a Streetfighter

One of the most exciting applications of Social Media data is the automated identification, evaluation and prediction of trends. I already sketched some ideas in this blog post. Last year – and this was one of my personal highlights – I had the opportunity to speak at the PyData 2014 Berlin on the topic of Street Fighting Trend Research.

In my talk I presented some more general thoughts on trend research (or “coolhunting” as it is called nowadays) on the Internet. But at the core were three examples on how to identify research trends from the web (see this blogpost), how to mine conference proposals (see this analysis of Strata abstracts) and how to identify trending locations on Foursquare (see here). All three examples are also available as IPython Notebooks on my Github page. And here’s the recorded version of the talk.

The PyData conference was one of the best conferences I attended. Not only were the topics very diverse – ranging from GPU optimization to the representation of women in the PyData community – but also the people attending the conference were coming from different backgrounds: lawyers, engineers, physicists, computer scientists (of course) or statisticians. But still, with every talk and every conversation in the hallways, you could feel the wild euphoria connecting us all with the programming language and the incredible curiosity.

Data begs to be used

The Sword and Shield - this was the metaphor for intelligence agencies in the Soviet world. For me, this is much stronger than NSA's key-clutching eagle. We should rather shield things we care for, and fight for our beliefs than just lock pick into other peoples lives. — The Sword and Shield – the metaphor for intelligence in the Soviet world, much stronger than NSA’s key-clutching eagle. Let us fight for our cause with our fiery swords of spirit and blind them with our armour of bright data.

“The NSA is basically applied data science.”
Jason Cust

The European Court of Justice declared the EU directive on data retention void the same day #hartbleed caused the grosset panic about password security the Net had seen so far.

This tells one story:
Get out into the open! Stop hiding! There are no remote places. No privileges will keep your data from the public. No genius open source hack will protect your informational self determination. And spooks, listen: your voyerism is not accepted as OK, not even by the law.

We’re living in a world with more transparency, we need to learn to do intelligence with more transparency too.
Bruce Schneier, #unowned

I was fighting against Europe’s data retention directive, too, and of course I was feeling victorious when it was declared void by the ECJ. But, don’t we know how futile all efforts are, to keep data protected? No cypher can be unbreakable, except the one-time-pad (and that is just of academic interest). No program-code can be proven safe, either – these are mathematical facts. Haven’t you learned how wrong all promisses of security are by #heartbleed? Data protection lulls us into feeling save from data breeches, where we should rather care to make things robust, no matter if the data becomes public intentionally or not.

The far more important battle was won the same week, when the European Parliament decided to protect net neutrality by law. This is the real data protection for me: protecting the means of data production from being engrossed in private.

The intelligence agencies’ damaging the Net by undermining the trust of its users in the integrity of its technology, is a serious thing by inself. However burrying this scandal just under the clutter of civil rights or under constitutional law’s aspects will not do justice. It is of our living in societies in transition from the modern nation state to the after-modern … what ever, liquid community; into McLuhan’s Global Village which is much more serious and much more interesting. We will have to go through the changes of the world and it might not be nice all the time until we come out the other end.

Data begs to be used.
Bruce Schneier, #unowned

In “Snow Crash”, Neal Stephenson imagines a Central Intelligence Corporation, the CIA and NSA becoming a commercial service where everybody just purchases the information they’d need. This was, what came immediately to my mind when I read through the transcript of the talk on “Intelligence Gathering and the Unowned Internet” that was held by the Berkman Center for Internet and Society at Harvard, starring Bruce Schneier, who for every Mathematician in my generation is just the godfather of cryptography. Representatives of the intelligence community were also present. Bruce has been arguing for living “beyond fear” for more than a decade, advocating openness instead of digging trenches and winding up barbed wire. I am convinced, that information does not want to be free (as many of my comrades in arms tend to phrase). However I strongly belief Bruce is right: We can hear data’s call.

The Quantified Self, life-logging, self-tracking – many people making public even their most intimate data, like health, mood, or even our visual field of sight at any moment of our day. An increasingly strong urge to do this rises from social responsibility – health care, climate research, but also from quite profane uses like insurances offering discounts when you let your driving be tracked. So the data is there now, about everything there is to know. Not using it because of fear for privacy would be like having abandond the steam engine and abstinate from the industrialization in anticipation of the climate change.

Speak with us, don’t speak for us.
statement of autonomy of #OccupyWallStreet

Dinosaurs of modern warfare, like the NSA are already escamotating into dragons, fighting their death-match like Smaug at Lake Esgaroth. We have to deal with the dragon, kill it, but this is not, what we should set our political goal to. It will happen anyway.

We should take provision for what real changes we are going to face. We will find ourselves in a form of society that McLuhan would have called re-tribalized, or Tönnies would not have called society at all, but community. In a community, the concept of privacy is usually rather weak. But at the same time, there is no surveillance, no panoptic elevated watchmen who themself cannot be watched. Everyone is every other one’s keeper. Communal vigilance doesn’t sound like fun. However it will be the consequence of after-modern communal structures replacing the modern society. “In the electric world, we wear all mankind as our skin.”, so McLuhan speaketh.

Noi vogliamo cantare l’amor del pericolo.
M5S

Every aspect of life gets quantized, datarized, tangible by computers. Our aggregated models of human behavior get replaced by exact description of the single person’s life and thus the predictions of that person’s future actions. We witness the rise of non-representative democratic forms like occupy assemblies or liquid democracy. What room can privacy have in a world like this whatsoever? So maybe we should concentrate in shaping our data-laden future, rather then protecting a fantasy of data being contained in some “virtual reality” that could be kept separate from our lives.

The data calls us from the depth; let us hear to its voice!

Text-Mining the DLD Conference 2014

Once a year, the cosmopolitan digital avantgarde gathers in Munich to listen to keynotes on topics all the way from underground gardening to digital publishing at the DLD, hosted by Hubert Burda. In the last years, I did look at the event from a network analytical perspective. This year, I am analyzing the content, people were posting on Twitter in order to make comparisons to last years’ events and the most important trends right now.

To do this in the spirit of Street Fighting Trend Research, I limited myself to openly available free tools to do the analysis. The data-gathering part was done in the Google Drive platform with the help of Martin Hawksey’s wonderful TAGS script that is collecting all the tweets (or almost all) to a chosen hashtag or keyword such as “#DLD14” or “#DLD” in this case. Of course, there can be minor outages in the access to the search API, that appear as zero lines in the data – but that’s not different to data-collection e.g. in nanophysics and could be reframed as adding an extra challenge to the work of the data scientist 😉 The resulting timeline of Tweets during the 3 DLD days from Sunday to Tuesday looks like this:

You can clearly see three spikes for the conference days, the Monday spike being a bit higher than the first. Also, there is a slight decline during lunch time – so there doesn’t seem to be a lot food tweeting at the conference. To produce this chart (in IPython Notebook) I transformed the Twitter data to TimeSeries objects and carefully de-duplicated the data. In the next step, I time shifted the 2013 data to find out how the buzz levels differed between last years’ and this years’ event (unfortunately, I only have data for the first two days of DLD 2013.

The similarity of the two curves is fascinating, isn’t it? Although there still are minor differences: DLD14 began somewhat earlier, had a small spike at midnight (the blogger meeting perhaps) and the second day was somewhat busier than at DLD13. But still, not only the relative, but also the absolute numbers were almost identical.

Now, let’s take a look at the devices used for sending Tweets from the event. Especially interesting is the relation between this years’ and last years’ percentages to see which devices are trending right now:

The message is clear: mobile clients are on the rise. Twitter for Android has almost doubled its share between 2013 and 2014, but Twitter for iPad and iPhone have also gained a lot of traction. The biggest losers is the regular Twitter web site dropping from 39 per cent of all Tweets to only 22 per cent.

The most important trending word is “DLD14”, but this is not surprising. But the other trending words allow deeper insights into the discussions at the DLD: This event was about a lot of money (Jimmy Wales billion dollar donation), Google, Munich and of course the mobile internet:

Compare this with the top words for DLD 2013:

Wait – “sex” among the 25 most important words at this conference? To find out what’s behind this story, I analyzed the most frequently used bigrams or word combinations in 2013 and 2014:

With a little background knowledge, it clearly shows that 2013’s “sex” is originating from a DJ Patil quote comparing “Big Data” (the no. 1 bigram) with “Teenage Sex”. You can also find this quotation appearing in Spanish fragments. Other bigrams that were defining the 2013 DLD were New York (Times) and (Arthur) Sulzberger, while in 2014 the buzz focused on Jimmy Wales, Rovio and the new Xenon processor and its implications for Moore’s law. In both years, a significant number of Tweets are written in Spanish language.

UPDATE: Here’s the IPython Notebook with all the code, this analysis has been based on.

The Quantified Self

The most common words in the Tweets tagged #qseu13 posted over the weekend. Here you find another visualization: [Wordcloud]

Last weekend the 4th Conference on Quantified Self took place in Amsterdam. Quantified Self is a movement or direction of thought that summarizes many aspects of datarization of the live of people by themselves. The term “QS” was coined by Kevin Kelly and Gary Wolf, who hosted the conference. Thus it cannot be denied that some roots of QS lie in the Bay-Area techno-optimistic libertarianism best represented by Wired. However a second root stems from people who started quantifying themselves to better deal with manifest health problems – be it polar disorder, insomnia or even Parkinson and cancer. In both aspects the own self acts as object and subject to first analyze and then shape itself. Both have to do with self-empowerment and acting on our human condition.

“For Quantified Self, ‘big data’ is more ‘near data’, data that surrounds us.”
Gary Wolf

Quantified Self can be viewed as taking action to reclaim the collection of personal data, not because of privacy but because of curiosity. Why not take the same approach that made Google, Amazon and the like so successful and use big data on yourself?

Tweets per hour during the conference weekend. Of course our physical life finds its expression in data .... — Tweets per hour during the conference weekend. Of course our physical life finds its expression in data ….

Since many QS-people use off-the-shelf gadgets, it is not only important to get full access to the data collected but also transparency on the algorithms that are implemented within. Like Gary Wolf pointed out, if two step-counters vary in their results, it tells us one thing: there is no common concept of ‘What is a step?’. These questions of algorithm ethics become more pressing as our daily life becomes more and more dependent on algorithms but we would usually not have a chance to see into that “black box” and the implicit value judgements that are programmed into it. (I just gave a talk on that specific topic at re:publica last Monday which I will post here later). I think that in no field the problems of algorithms taking ethic decisions becomes more obvious than when data deals immediately with yourself.

What self is there to be quantified?

What is the “me”? What is left, when we deconstruct what we are used to regard as “our self” into quanta? Is there a ghost in the shell? The idea of self-quantification implies an objective self that can be measured. With QS, the rather abstract outcomes of neuroscience or human genetics become tangible. The more we have quantitatively deconstructed us, the less is left for mind/body-dualism.

On est obligé d’ailleurs de confesser que la Perception et ce qui en dépend, est inexplicable par des raisons mécaniques.
G. W. Leibniz

As a Catholic, I was never fond that our Conscious Mind would just be a Mechanical Turk. As a mathematician, I feel deep satisfaction in seeing our world including my very own self becoming datarizable – Pythagoras was right, after all! This dialectic deconstruction of suspicious dualism and materialistic reductionism was discussed in three sessions I attended – Whitney Boesel’s “The missing trackers”, Sarah Watson’s “The self in data” and Natasha Schüll’s “Algorithmic Selfhood”.

“Quantifying yourself is like art: constructing a kind of expression.”
Robin Barooah

Many projects I saw at #qseu13 can be classified as art projects in their effort to find the right language to express the usually unexpresseble. But compared to most “classic” artists I know, the QS-apologetes are far less self-centered (sounds more contradictory than it is) and much more directed to in changing things by using data to find the sweetspot to set their levers.

What starts with counting your steps ends consequently in shaping yourself with technological means. Enhancing your bodily life with technology is the definition of becoming a Cyborg, as my friend Enno Park points out. Enno got Cochlea-implants to overcome his deafness. He now advocates for Cyborg rights – starting with his right to hack into his implants. Enno demands his right to tweak the technology that became part of his head.

Self-hacking will become as common as taking Aspirin to cure a headache. Even more: we will have to get literate in the quantification techniques to keep up with others that would anyway do it for us: biometric security systems, medical imaging and auto-diagnosis. To express ourselves with our data will become part of our communication culture as Social Media have today. So there will be not much of an alternative left for those who have doubts about quantifying themself. “The cost of abstention will drive people to QS.” as Whitney Boesel mentioned.

Top Twitterers for #qseu13-conference: 1) Whitney Erin Boesel, 2) Maneesh Juneja 3) that's me ;) — Top Twitterers for #qseu13-conference: 1) Whitney Erin Boesel, 2) Maneesh Juneja 3) that’s me 😉

The immutability paradigm – or: how to add the “fourth dimension” to our data

Our brain is wired to experiencing the world as one consistent model of reality. New data we interpret either as confirmation of the model or as an update to replace one of its parameters with a new value. Our sensory organs also reduces the incoming stimuli, drop most of the impressions, preprocess what is identified as signals to simple patterns that are propagated to our mind. What we remember as the edge of our table – a straight line, limiting the surface – was in fact received as a fine grid of multicoloured pixels by our retina. For sake of saving computation and storage power, and to keep a stable, consistent view, we forsake the richness of information. And we use to build our data bases to work exactly that way.

One of the realy disruptive shifts in our business is imo to break this paradigm: “Make your source of truth immutable.” Nathan Marz (who has just yesterday left the Twitter team) tells us to have a base layer of incoming data. Nothing here gets updated or changed. New records are just attached. From such an immutable data source, we can reconstruct the state of our data set at any given point of time in the past; even if someone messes with the database, we could roll back without the need to reset everything. This rather unstructured worm is of course not fit to get access to information with low latency. In Marz’ paradigm it is the “source of truth”, is a repository to feed into a second level of more “classic” data bases that provides precalculated, prepopulated tables that can be accessed at real time.

What Nathan Marz advocates as a way to make data bases more tolerant against human fault entails in fact a deep, even philosophical perspective. With the classic database we would keep master data and transaction data in different tables. We would regard a master record as something that should provide one consistent view on the object recorded. Take a clients data base of some retailer: Address or payment information we would expect to be a static property of the client, to be kept “up to date” – if the person moves, we would update the record. Other information we would even regard as unchangeable: Name, gender or birthday for example. This is exactly how we would be looking at the world if we had remained at the state of the naive phenomenology of the early modern ages. Concepts like “identity” of a human being reflect this integral perspective of an object with master properties – ideas like “character” (individual or even bound to ethnicity or nation) stem from this taking an object as in reality being independent from the temporal state of data that we could comprehend. (Please excuse my getting rather abstract now.)

Temporal logic was developed not in philosophy but rather in computer science. The idea is, that those apodictical clauses of “true” or “false” – tertium non datur” – that we are used to deal with in propositional calculus since the time of the ancient Greeks, would not be correctly applicable to real world systems like people interacting with other in time. – The “classic” example would be a sentence like “I am hungry” that would never necessaryly be true or false because it would depend on the specific circumstances at that point in time when I would have stated it; nevertheless it should be regarded as a valid property to describe me at that time.

In such way, the immutable database might not reflect our gut feeling about reality, but it certainly is a far more accurate “source of truth”, and not only because it is more tolerant against human operators tampering with the data.

With the concept of one immutable source of truth, this “master record” is just a view on the data at one given point in time. We would finally have “the forth dimension” in our data.

Strata Conference – the first day

Here’s a network visualization of all tweets referring to the hashtag “#strataconf” (click to enlarge). The node size is representing the number of incoming links, i.e. the number of times this person has been mentioned in other people’s tweets:

This network map has been generated in three steps:

1) Data collection: I collected the twitter data with the open source application YourTwapperKeeper. This is the DIY version of the TwapperKeeper platform that had been very popular in the scientific community. Unfortunately after the acquisition by HootSuite it is no longer available to the general public, but John O’Brien has made the scripts available via his githup. I am running yTK on a Amazon EC2 instance. What it does is connecting to the Twitter Streaming API and fetching all tweets with “#strataconf” in realtime and additionally doing regular searches via the Search API to find tweets that had been overlooked by the Streaming API.

2) Data processing: What is so great about yTK: It offers different methods to fetch the tweets you collected. I am using the JSON API to get the tweets downloaded to my computer. This is done with a Python script. The script opens the JSON file and then scans the tweets for mentions or retweets with the following regular expressions I borrowed from Matthew Russell’s book Mining the Social Web:

rt_patterns = re.compile(r"(RT|via)((?:\b\W*@\w+)+)", re.IGNORECASE)
at_pattern = re.compile(r"@(\w+)", re.IGNORECASE)

Then I am using the very convenient library igraph to write the results in the generic graphml file format that can be processed by various network analysis tools. Basically I am just using this snipped of code on all tweets I found to generate the list of nodes:

if not tweet['from_user'].lower() in nodes:
    nodes.append(tweet['from_user'].lower())

… and edges:

for mention in at_pattern.findall(tweet['text']):
    mentioned.append(mention.lower())
    if not mention.lower() in nodes:
        nodes.append(mention.lower())
    edges.append((nodes.index(tweet['from_user'].lower()),nodes.index(mention.lower())))

The graph is generated with the following lines of code:

g = Graph(len(nodes), directed=True)
g.vs["label"] = nodes
g.add_edges(edges)

This is almost the whole code for processing the network data.

3) Visualization: The visualization of the graph is done with the wonderful cross-platform tool Gephi. To generate the graph above, I reduced the network to all nodes that have at least one other node referring to it. Then I sized the nodes according to their total number of degrees, that is how often they were mentioned in other people’s tweets or how often they were mentioning other users. The color is determined by the modularity clustering algorithm. Then I used the Force Atlas layout algorithm and voilà – here’s the network map.

Big Data investment map

While getting ready for the Strata Conference in Santa Clara, I prepared a network map showing the most important companies represented at the conference and their connections via venture capital or investment firms (click to enlarge). See you at the conference!

10 Points Why Market Research has to Change

_{(This is the transcript of a key-note speech by Benedikt and Joerg 2010 on the Tag der Marktforschung, the summit of the German Market Researchers’ Professional Association BVM – [1])}

Market research as an offspring of industrial society is legitimized by the Grand Narrative of modernism. But this narrative does no longer describe reality in the 21st century – and particularly not for market research. The theatre of market research has left the Euclidian space of modernism and has moved on into the databases, networks and social communities. It is time for institutionalized market research to strike tents and follow reality.

“Official culture still strives to force the new media to do the work of the old media. But the horseless carriage did not do the work of the horse; it abolished the horse and did what the horse could never do.” H. Marshall McLuhan

1. Universally available knowledge

In facing the unimaginable abundance of structured information available anytime and everywhere via the Internet, the idea of genuine knowledge progress appears naïve. When literally the whole knowledge of the world is only one click away, research tends to become database search, meta analysis, aggregating existing research or data mining, i.e. algorithm-based analysis of large datasets.

2. Perpetual beta

What can be found in Wikipedia today is not necessarily the same that could be found a few weeks ago. Knowledge is in permanent flow. But this opposes the classic procedure in market research: raising a question, starting fieldwork, finding answers and finally publishing a paper or report. In software development, final versions have been long given way to releasing versions; what gets published is still a beta version, an intermediate result that gets completed and perfected in the process of being used. Market research will have to publish its studies likewise to be further evolving while being used.

3. Users replacing institutes

The ideal market researcher of yore, like an ethnologist, would enter the strange world of the consumers and would come back with plenty of information that could be spread like a treasure in front of the employer. Preconditions had been large panels, expensive technologies, and enormous amount of special knowledge. Only institutes were able to conduct the costly research. Today, however “common” Internet users can conduct online surveys with the usual number of observed cases.

4. Companies losing their clear boundaries

“Force and violence are justified in this [oiconomic] sphere because they are the only means to master necessity.” Hannah Arendt

In her main opus “The Human Condition” Hannah Arendt describes the Oikos, the realm of economy, as the place where the struggle for life takes place, where no moral is known, except survival. The Polis in opposition to this represents the principle of purposeless cohabitation in dignity: the public space where the Oikos with its necessities has no business.

When large corporations become relevant if not the only effective communication channels, they tend to take the role of public infrastructure. At the same time, by being responsible only to their shareholders, they withdraw from social or ethical discussions. As a result, their decisions could hardly be based on ethic principles. With research ethics, the crucial question is: Can our traditional ways of democratic control, based on values that we regard important, still be asserted – and if not, how we could change this.

5. From target groups to communities

The traditional concept of target groups implies that there are criteria, objective and externally observable, which map observed human behavior sufficiently plausible to future behavior or other behavioral observations. Psychological motives however are largely not interesting.

Contemporary styles of living are strongly aligning, making the people appear increasingly similar one to each other (a worker, a craftsman or a teacher all buy their furniture at the same large retailer); however, with the Internet there are more possibilities than ever to find like-minded people even for most remote niche interests.

Often these new communities are regarded as substitute for target groups. But communities are something completely different from target groups, characterized by their members sharing something real and subjective in common: be it common interest or common fate. Thus, objective criteria also become questionable in market research.

6. The end of the survey

Google, Facebook and similar platforms collect incredible bulks of data. By growth of magnitudes quantity reverts to quality: they don’t just become bigger, but become completely new kinds of objects (e.g. petabyte memory and teraflop databases).

Seen from the users’ perspective Google is a useful search engine. From a different perspective, Google is a database of desire, meaning, context, human ideas and concepts. Given these databases, data collection is no longer the problem: rather to get the meaning behind the numbers.

7. Correlations are the new causalities

“Effects are perceived, whereas causes are conceived. Effects always precede causes in the actual development order.” H. Marshall McLuhan

In marketing it is often not important if there is really some causality to be conceived. Some correlation suffices to imply a certain probability. When performance marketing watches how people get from one website to the other, without being able to explain why: then it is good enough to know, that it is just the case.

8. The end of models

“Only for the dilettante, man and job match.” Egon Friedell

There is, as sketched before, an option for market research without theory. To take profit from new data sources and tools, the future market researcher has to act like a hacker, utilizing interfaces, data and IT-infrastructures in a creative way to achieve something that was not originally intended by this technology. The era of the professional expert is over.

9. Objectivity is only a nostalgic remembrance

Objectivity is a historical construct, formed in the 19th century and being kept in the constellation of mass society, mass market and mass media for a remarkably long time. With social networks in the Internet you do not encounter objects or specimens, but human beings that are able to self-confidently answer the gaze.

10. We need new research ethics

The category of the “consumer” shows the connection between aiming to explore our fellows and the wish to manipulate them. Reciprocity, the idea to give the researched population something back, has not been part of traditional market research’s view of the world.

In the Internet on the other hand, reciprocity and participation are expected. This has pivotal implications for our research ethics, if we want to secure the future cooperation of the women and men, which would not want to get mass-surveyed and mass-informed either. Likewise, some existing ethical concepts have become obsolete, such as the anonymity of the participants: who takes part in a project as a partner, does not need and not want to stay anonymous.

“Handle so, dass du die Menschheit sowohl in deiner Person, als in der Person eines jeden anderen jederzeit zugleich als Zweck, niemals bloß als Mittel brauchst.” Immanuel Kant

Category: conference