The Scientific Method: Testing and Significance in the Age of Big Data

“One hundred and sixty eight (68 men and 100 women) undergraduates from a small, private college in Pennsylvania participate d in this study.”
(L. McDermott, T. Pettijohn II: “The Influence of Clothing Fashion and Race on the Perceived Socioeconomic Status and Person Perception of College Students.” Psychology & Society, 2011, Vol. 4 (2) , 64 ‐ 75

Draper: “What do women want?”
Stirling: “Who cares!”

One of my colleagues at Max-Planck-Institut once came to me with a draft paper. It dealt with dimorphism in sexes and would present evidence, that most differences could be explained from genetic heritage. The method that was mandatory practice at this institute was social biology -every behavior should only occur with humans (and animals likewise), if a clear evolutionary advantage could be derived from it. Since it was the early 90s, the fight of science against postmodernism was still at its peak. Postmodernist thinking, like “it could just be our imposing social conventions into our methods to learn what we already knew”, was brusquely brushed away, because “we use the scientific method, don’t we?”.
In the paper, my colleague presented results of some surveys he had conducted that showed correlation of the perceived “beauty” of people on images with the (I forgot how he had quantified it) beauty of the subject’s spouse (I also forgot the thesis he would derive thereof). The correlation was very weak, like R²~0.6 or so. But because he had surveyed several hundred people, it became significant; it proved his absurd postulate.

My example might sound bizarre for the layperson, but have a look in publications like Journal for Personality and Social Psychology (where my colleague had issued his paper to).

The scientific method in general, but in quantitative social sciences in particular, has four steps to take:
1. Formulate the hypothesis
2. Draw a representative sample of observations
3. Test the hypothesis and prove it significant
4. Publish the results for review.

For now I do not want to focus on the strange reviewing practices that do not really publish results but rather keep them within closely confined boundaries of scientific journals, inaccessible for the public, only available for a small academic elite, so that a sound review hardly takes place.

I want to discuss the first three steps, because during the last 20 years, my professional field has undergone a dramatic paradigm shift, regarding these, while the forth is still holding for the time being.

The quantitative methods in social science originate from the age of the mass society within a nation state. These methods were developed as tools to help management and politics with their decisions. Alternatives to be tested were usually simple. The industrial production process would not allow for subtile variations in the product, thus it would be sufficient to present very few – usually two – varieties to the survey’s subjects. People’s lives would be likewise simple: a teacher’s wife would show a distinctive consumption pattern, as would a coal miner. It would be good enough to know people’s age, gender, and profession to generalize from one specimen to the whole group. Representativeness means, that one element of a set is used to represent the whole set – and not just with the properties, that would characterize the set itself (like male/female, Caucasian/Asian, etc.); the whole set would inherit all properties of its representative. It is counting the set as one, as Alain Badiou puts it.

We are so used to this aggregation of people into homogeneous sets, that we hardly realize its existence anymore. The concept of “target groups” in advertising is justified with this, too. Brands buy advertising by briefing the agency with gender, age, education of the people the campaign should reach. Prominent is the ABC-Audience in the UK, a rough segmentation of the populace just by their buying power and cultural capital.

In the mass society, up to the 1970s, this more or less seemed to make sense. People in their class or milieu would behave sufficiently predictable. Especially television consumption still mirrors this aspect of mass society: ratings and advertising effect could be calculated and even predicted from the TV measurement panels with scientific precision. 2006 I came in charge of managing one of the largest and longest running social surveys, the “Typologie der Wünsche“. Topics covered where consumption, brand preferences, and many aspects of people’s opinions and daily routine, surveyed by personal interview of 10,000 participants per year. Preparing a joint study with Roland Berger Strategy Consultants, I examined the buyers of car brands regarding all aspects of them being defined as a “target group”. The fascinating result was: while in the 1980s to the mid 1990s, buyers of car brands had been indeed quite homogeneous regarding their political opinions, ecological preferences, consumption of other brands, etc., this seemed to wane away over the last decade. The variance increased dramatically, so that to speak of “the buyer of a car brand” could be questioned. This was even more true for fast-moving consumer goods. Superficially, this could be explained by daily consumption becoming cheaper in proportion to average income, so poorer consumers where no longer as much restricted to certain goods as in earlier times. But the observation would prevail, even if just people with comparable wealth were taken into account.

Our conclusion: the end of mass media (that my employer was suffering from, like most traditional publishers), might come along with the end of mass society, too. The concept of aggretating people by objective criteria, by properties observable from the outside, like gender, income, or education, was getting under pressure.

For Israeli military strategist Martin van Creveld, this is also the underlying condition for what he calls the “Transfromation of War”. In military philosophy, the corresponding paradigm is the idea of soldiers and civilians, unfolded by von Clausewitz. Van Crefeld argues, that the constructs of Clausewitz’ theory, like ‘peoples’ (‘Völker’) had never existed in the first place. They were just stories told to organize war at industrial scale. And van Crefeld explicitly deconstructs the gender gap in battle. His book is full of quantitative proof that men, just regarded from the physical perspective, would make no better soldiers, than women. The the distributions of women and men in size, weight, physical strength etc. are mainly overlapping. Of course, the mean size of men is taller for a few centimeters than the size of women. This mean difference is significant, if you make a t-test. But as always, a significant mean difference says nothing about the individual. Most women are as tall as most men. Just some men are taller, and some women are smaller.

The fallacy of significance-testing should be obvious. It presumes the subjects would be originating from different universes, disjoint subsets of the population. Testing takes it for granted, that hypothesis and alternative are truly distinct, that only one can hold. This is hardly ever the case when humans are concerned. For most properties that we are studying in social research, the intra-set variance is much bigger than the variance between two sets, be it gender, be it age, education, hair color, or what ever criteria we choose to form the subsets. Women in most aspects are in average less similar one to each other, than the value of the mean difference of women and men.

This given, the logical next question would be, if the method had been correct, at all. The conditions of the industrial age made it only possible to serve their products to aggregates of people. Representative democracy also only gives the choice between a handful party programmes. And mass media could in principle not match individual preferences. So it seemed logic to place people in categories, too, without bothering that dichotomous variables like sex might not be appropriate to map people’s gender. Quantitative social research just reproduced the ideological restrictions of mass society.

With the Web, people suddenly had the choice, not only regarding media, but also regarding consumption. And -surprise!- people do act individually, and the actions are so random, that no correlation holds for more than a couple of weeks. “Multi-optional consumer” is a helpless way to express, that the silos of segmentation no longer make sense. Of course nobody has ever encountered a multi-optional person; on the individual level, people’s behavior is mostly continuous and perfectly consistent; it is just no longer about “what women want”.

The Web also presented for the first time a tool to collect data describing (nearly) everyone on the individual level. However, with trillions of data points on billions of users, every difference between subgroups becomes significant, anyway. Dealing with results like in the example of my colleague’s ethological study mentioned above, the problem comes from taking significance as absolute. No matter how small an effect is, as long as it is significant, it will be considered proven. But statistical inference was designed to suite sample sizes of some ten to some low in the thousands people. It is ill-suited to deal with big data.

The jokes about silly correlations with Google trends are thus totally correct. And this demonstrates also another aspect: significance and hypotheses testing is regarded as static while data remains dynamic. While at some point in time, a correlation of Google trends with other time series might just randomly become significant, it is highly unlikely, that this bogus correlation will survive. Data science, other than classic quantitative research, tends to deal with data in an agile way, which means, that nothing is regarded as fixed. But if we see our data as ephemeral, there is no need to come up with models that we restrict according to fixed proven hypothesis.

So the role of statistics for social science changes. Statistics is now the tool to deal with distributions as phenomena as such rather than just generalizing from small samples to an unknown population. We should use the stream of data as the life-condition in which our models would have to struggle to survive in. Like with biological evolution, we would not expect the assumptions to remain stable. We would rather expect, the boundary conditions to change, and our models would have to adjust; survival of the fittest model means: the fittest for now.

The philosophical justification for inference is the idea of the general comprehensibility of reality. Like St. Augustine we postulate that it is possible to extrapolate from perception (=measurement, data) to the world of things. But like our sensory organs have been evolving, driven by environmental change (and mutations in our genome), we should regard the knowledge we derive from data as “shadow on the cave wall” at best.

This is far better than it sounds: it gives us freedom to explore data rather than just test our made-up hypotheses, that would just perpetuate our presumptions.

Let’s leave statistical testing and significance where it belongs to: Quality assurance, material testing, physical measurements -engineering.
Let’s be honest, and drop it in the humanities.

Some more texts