Statistical inference, bias and big data

These days Big Data is all the rage.  As you might expect, big data describes large data sets, and more specifically, data sets that are so big that traditional analytical tools are often not sufficient to make sense of them.  Much of the big data revolution originates with the internet, mainly because the internet makes the dissemination of data cheaper and easier, but also because the internet generates big data itself.

On the surface, it might seem that bigger is always better.  Often this is probably true; for example, in experimental research designs more data means more generalizability, which means the research impact is broader, and the findings are less likely to be due to chance.  But the flip side is that big data can obscure mistakes, and make things that are unimportant seem important.  In short, everything that makes big data good at finding new and interesting discoveries is also good at finding things that are unimportant and not true.

Consider the following model:


The outcome is something of interest, for this example, life expectancy.  The true determinant is something that has a direct and real influence on the outcome–like exercise. The bias is a factor that has no real influence on the outcome–say hours spent reading the National Post–but has an apparent effect because of systematic error in the collection or analysis of data.  The random error is the sum of all other unmeasured determinants that might explain the outcome.

Random error is not really a problem provided it is independent of the other determinants of the process. Indeed, big data has an indirect way of reducing the impact of random error; generally speaking, the larger the data set, the easier it is to measure effects that are small in relation to the size of the random error.  So if exercise has a very small influence on life expectancy (and most of the variation in life expectancy is contained in the random error), with large enough data I will still be able to measure the effect of exercise on life expectancy.

Bias is the most worrisome part in the process I’ve described here.  Bias is systematic error, which means that it doesn’t just obscure effects (like random error) but misleads us by making an effect look to have a different relationship with an outcome than what is actually the case.  Specifically, it either makes an important effect look unimportant, or an unimportant effect look important.  In small data sets where the true effects are strong (such as the association between smoking and lung cancer), small biases are probably of little concern because small data sets are less likely to reveal them.  Large biases are always a concern, but careful review of methods will often catch these a priori, so they can be filtered out of research by rigorous scientific review. Small biases can become a problem in big data, however, because big data can detect small effects, particularly if researchers use ‘statistical significance’ as a benchmark for accepting or rejecting an effect as important.

To illustrate the problem, I offer the results of a computer simulation based on the model above

N bias and inference

Each line shows the change in ‘p value’ with increasing size of the data set.  P values are often used to make inferences about effects–the smaller the p value, the more likely a researcher will accept an effect as interesting (‘statistically significant’).  The x-axis is the log of data set size to help with the interpretation of the graph; a value of 5 is a data set of about 150 observations, and a value of 9 is a data set of about 9000 observations.  The colours correspond to the amount of bias–basically, a rough reference for the magnitude of systematic error.  A value of 10% is a bias with a magnitude 10% that of a true effect of interest, a value of 50% is a bias with a magnitude half that of a true effect, and the blue circles correspond to bias roughly the size of the true effect.

In this simulation small errors (10% or less) are unlikely to lead to regularly spurious conclusions in data sets with less than 5 to 10 million records.  But bias of a size half that of a true effect becomes a problem once data sets get to be 5000 observations or more in size; it is at this point that p values get to be well below the 0.05 threshold that some researchers use to determine importance of an effect.

This simulation isn’t terribly generalizable, but it clearly illustrates the (very predictable) point that big data is better at finding effects–both real ones, and systematic errors–than small data.  It is hard to comment about the scope of the problem in practice, but it is enough of a problem that some applied statisticians have questioned the value of ‘statistical significance’ in the era of big data.  There are various solutions to the problem, but my preferred option is to focus on effect size; for large data sets assume that all measures are ‘significant’, but then describe the importance of the effects with respect to magnitude.  I will discuss some examples of this in future posts.