I recently wrote a snippet of R code to show some students how easy it is to mess around with data to make uncorrelated variables appear correlated. This kind of fraudulent data mining is the kind of thing that a decent data analyst might detect if they are careful, but is easy for a non expert to overlook, and can be missed by experts having an off day.
Let’s start with two uncorrelated variables: x and y. Here’s a scatter plot:
These data are clearly uncorrelated (R = -0.02).
However, if we aggregate these data by something — say locations, age strata, pretty well anything–we may see a different correlation than observed in the original form. The reasons for this are well studied (there are dozens if not hundreds of papers on it), and is partly related to the reduction in variability we see across the aggregated values when compared to the original data. Academics have used the term ‘ecological fallacy’ to describe the consequences of this effect for decades. The main concern is that since correlations between aggregate data are often not the same as correlations between disaggregate data, one should be very careful about using ecological data to draw conclusions about associations between variables measured at the individual level.
Using some R code I have now posted on GitHub, you too can now create aggregate groups that increase apparent correlation in aggregate data that are uncorrelated at the level of individual observations. Using this code and the same data that generated the graph above, I can adjust the groupings to show the following association:
These data now appear fairly strongly correlated, however this is entirely due to the aggregation process, not any true underlying correlation between the variables.
The algorithm I used to do this is very simple, and involves shuffling around group membership before calculating the correlation between group means. For illustration, imagine you had data that looked like this,
and then calculated the Pearson correlation for the group means (say 0.15). If we then swap around some of the groupings a little (see highlighted rows),
we may find that it increases the resulting correlation. The algorithm keeps changes that increase the apparent correlation, and over time, is guaranteed to increase the apparent correlations between group means.
In the real world, examination of these artificial groupings would reveal some quantitative trickery. But the cute thing about the algorithm is that one could start with sensible groups (say, based on geography or time periods) and let the algorithm make a small number of changes to increase the correlation a modest amount–small enough changes that one could perhaps evade detection, but still produce the desired effect.
What is the meaning of this?
Using this method, you can see that it is fairly easy to manipulate data to show pretty well any association you want. As I mentioned at the top of the post, experienced data analysts can usually sniff out this kind of stuff fairly easily, but a careful data fraudster could probably escape all but the most careful scrutiny. It’s a good reason to never aggregate data unnecessarily, and when you do aggregate data, aggregate them into groups that make sense and are widely accepted.