Perils of user generated epidemiology

I recently worked with some students on a short analysis of Lyme disease content on YouTube.  I thought the results of the work were worth writing up for publication.  The paper will be published in the journal Social Science & Medicine in the fall, and the online manuscript version is available for download until October 29th:

The findings are pretty intuitive.  There is considerable YouTube content on Lyme disease, but most of it is neither scientific or focused on infection prevention.  Personal stories and videos about celebrities are popular and are among the most common content available.  Public health agencies and academics produce very little content, and the content they do produce doesn’t receive much viewer interest. This may explain why the public health content that is most available on YouTube is often inconsistent with best practices recommended by experts in the field.

Given the large number of video views on Lyme disease, and the absence of content published by experts, it seems that public health agencies should create more content for YouTube.  This is not simply a scholarly exercise, but a scientific and even moral imperative–many people get information from YouTube and other forms of online user generated media, and need to be provided better information.  To maximize efficacy, these agencies should generate content that combines evidence-based public health information with personal stories about people who have experiences living in a areas where Lyme disease risk is high.

So bad it’s good

When I was a kid, watching the Littlest Hobo was little more than a unwanted reminder that my family didn’t have cable TV.

Thanks to YouTube, I can now experience the show as an adult, and pretend that its campy low-quality production value was intentional.

Here is a link to episode 1:

I think the lessons in this episode are obvious:

  • Don’t start forest fires
  • Don’t leave poisoned meat out for toddlers to consume
  • Dogs are great parachutists

Kindergarten Encryption

My wife and I have discovered the perfect (though time limited) encryption technology–our five-year-old daughter!

Here is the process.  One of us relays a message to our daughter.  She writes it down.  The message is now encrypted.  Nobody other than my wife and I can actually understand what she has written.  In fact, not even my daughter can figure it out.  Here’s an example:

There are a few words in there that you can probably make out, but I’m willing to bet that you can’t translate it in its entirety.  For me?  No problem:

“Thanks for bringing me to the fair. And then Lisa’s cat sped for the loop-the-loops.  Oh, I guess he wants to go on the loop-the-loop. Oh I don’t want to call him ‘my cat’ any more.  Maybe we could call him ‘loopy’!”

Here’s a more challenging example:

Can you figure it out?

So if I had to send a secret message to my wife while she was away from home, I would simply get my daughter to write it down and then give it to a courier.  I wouldn’t have to worry about the message getting intercepted, since anyone who read it would be clueless as to its meaning.

We’re lucky the Germans never figured this out during World War II.  All they needed was to put a kinder-gardener on every U-boat, and we’d have never won the war.

The only problem is that in a year or so my daughter’s writing will improve to the point where pretty well anyone will be able to decipher it.  Yet another reason to wish she wouldn’t grow up so fast…

On super villains and economic productivity


Steven Pinker and Nassim Taleb have been locked in a fight over violence more or less ever since Pinker published his book Better Angels of our Nature.  One of Pinker’s claims is that war is less common (and less calamitous) now than it was in the past, and that this trend could predict a decline in the probability of a catastrophic world war III.  Taleb co-authored a piece calling Pinker’s analysis into question.  A debate emerged between them, and continues to rage between their intellectual surrogates online.

The conflict between these positions was originally due to a disagreement on the true probability distribution of war deaths over time.  If the frequency of war deaths follows a normal (or other thin-tailed) distribution, then massive war death is very unlikely, since the probability of it occurring would be way out in the tail of the distribution where things never happen.  On the other hand, if the frequency of war deaths is ‘fat-tailed’ then world war III might be comparatively more probable.  Taleb and his co-author disagree with Pinker mainly because they argue that the probability distribution of war deaths is fat-tailed, which makes finding a predictive trend using historical data on prior human conflict very difficult.

counter argument put nicely by economist Michael Spagat is that there are other kinds of data that suggest the probability of World War III is very low.  There has been a striking decline in between-state conflict over the last half century, and between-state conflicts are more likely to result in massive war casualties than other types of conflict.  Furthermore, the emergence of geopolitical institutions like the UN are helping to bring potential conflicts under control.  Many of those in favour of the EU have argued that these multilateral political institutions have been key at reducing conflict in Europe; once countries are bound to each other economically, they have less incentive to go to war.  These kinds of factors may provide a reason to believe the frequency of war and war death will decline independent of the historical trend by itself.

I agree that there do seem to be some successful peace-making institutions that may have reduced the likelihood of between-state conflicts, and that between-state conflicts seem much less common now than in the past.  However, it is more important to figure out whether the balance of these exogenous factors will lead to an upward or downward trend in war death.  In this regard, I am decidedly pessimistic.  I will argue here that there is one overwhelmingly powerful exogenous process that virtually guarantees the abrupt end of humanity.

The problem

The looming threat to humanity is not, in my view, solely related to conflict between states or pseudo-state actors (like guerrillas or terrorists). Specifically, my argument is that over time it will become progressively easier for one megalomaniacal super villain to destroy the world.

The kill efficiency ratio

My argument begins with the assumption that there will always be a few megalomaniacs that want to destroy the world.  This means that the motivation for global annihilation is ever present.  The reason the world has not ended is not the absence of motivated lunatics, but that so far no one guy—or even small group of guys—with the motivation to destroy the world has had the practical capacity to do it. Theoretically, one person could have destroyed the world 30,000 years ago with just fire, if he had enough time, experienced no resistance from other humans, and had the right weather conditions.  But practically speaking, the tools he had were just not efficient enough.  These killing technologies had a low kill efficiency ratio (KER).

The KER is a ratio of the number of people that could be killed with a given technology divided by the number of people required to employ the technology.  The larger the KER, the more efficient the technology is at killing.  I’d guess that prehistoric technology probably had a KER of roughly 1:1 plus or minus an order of magnitude.  So one guy could probably kill one other guy, on average, but that’s about it. Wars involved lots of killing, but the Dr. No of antiquity still had to recruit a number of men to do his killing equal to, more or less, the number of men he wanted to kill.

The global KER has probably risen considerably over the last 200 years or so, with a big bump in the 1940s and 1950s with the invention of the atomic and hydrogen bombs.  Still, while the practical KER is higher than it once was, it is still nowhere near enough for one man to kill everyone on earth (it’s still far below a value of 7 billion).  It would still take a fairly large number of well coordinate lunatics to design, build and launch enough bombs to destroy the world.  Much like fire, existing technologies could destroy the world in theory, but would still require a large number of participants to get the scheme to work in practice. Indeed, if isolated (and a-political) mass killings by humans is any indication, the KER is still probably still less than 1000:1 or so today.

The problem with productivity

The future of KER will depend very much on the trend in general human productive efficiency—how much effort it takes to do something useful.  Advances in productive efficiency provide us more leisure time, more stuff, and is the basis of how the world measures economic success.  Productivity and the expectation of growth in productivity is also at the heart of the modern economy.  In fact, if the investment world anticipated a flattening or decline in future productivity, the global economy would probably collapse–money would stop flowing into businesses, banks would stop lending, and the world would sink into a deep economic abyss.  The anticipation of future improvements in productivity is what keeps modern capitalism working, perhaps more than anything else.

As such, there is a powerful institutional force pushing for increased productivity over time–doing more and more with less and less work (i.e., fewer and fewer people).  Many of these advancements are general purpose–like computers.  Computers are major productivity enhancers for a wide range of human activities; soon computers will be driving cars, arguing our cases in court, diagnosing our diseases and caring for the elderly (something I very much look forward to when I am an old(er) man in need of care!).  As long as the financial incentives of increasing productivity remain, we can expect a continuous creation of general purpose productivity enhancers.

Like science generally, productivity enhancers are useful, but not inherently moral or immoral.  As Dawkins puts it in his description of the scientific process:

If you want to do evil, science provides the most powerful weapons to do evil; but equally, if you want to do good, science puts into your hands the most powerful tools to do so. The trick is to want the right things, then science will provide you with the most effective methods of achieving them.

As our productivity increases, we become more efficient at most things–both good and bad.  On the whole, wealth has probably been good for humanity–wealth improves health and happiness–however, while our gains in productivity increase our wealth and leisure time, they also enable ne’er-do-wells.  At some point in the future, general productivity could reach a level that enables evildoers to do a great evil to the earth.

We are already seeing signs of very efficient mischief in this world.  One computer hacker can make a pretty big nuisance of himself.  A small group of them can get international headlines.  Comparatively small groups of non-state actors–like ISIS and drug cartels–are able to create considerable havoc, and force the hand of governments to intervene. Still, these groups remain pretty limited in their power beyond their region of direct control, and still need large numbers of recruits to maintain their positions of power. They do terrible things, but they are still fairly inefficient.

It’s hard to know if improvements in global general purpose productivity will accelerate or decelerate over the coming years.  However it is clear that there remains a strong global and structural push towards increased productivity over time–getting more output with less work–and it’s hard to imagine that changing outside some zany Marxist revolution. So it seems reasonable to assert that the opportunities for super villains to destroy the world will only increase over time, and that the end of the world at the hands of an evil megalomaniac seems more a question of ‘when’ rather than ‘if’.

My conclusion

In many ways we humans are probably more peaceful now than we have ever been in the past.  So perhaps Pinker is onto something–in spirit, humans are trending away from violence, on average.  However, the social and political institutions that have contributed to this period of peace and peacefulness do not address the march towards ever greater productivity, and in turn, greater kill efficiency.  In fact, even if every country in the world were an orderly and prosperous democracy populated by a citizenry that is, on average, more kind and generous and agreeable than we are today, it seems very likely that there will always be a few megalomaniacs hell-bent on mass destruction—just like there are always jerks who thumb-down videos of babies on YouTube.  What’s worrisome is that as we become more productive, the destructive potential of these megalomaniacs becomes ever greater.

The Fallicizer: a simple trick to create false correlation

I recently wrote a snippet of R code to show some students how easy it is to mess around with data to make uncorrelated variables appear correlated.  This kind of fraudulent data mining is the kind of thing that a decent data analyst might detect if they are careful, but is easy for a non expert to overlook, and can be missed by experts having an off day.

Let’s start with two uncorrelated variables: x and y.  Here’s a scatter plot:

These data are clearly uncorrelated (R = -0.02).

However, if we aggregate these data by something — say locations, age strata, pretty well anything–we may see a different correlation than observed in the original form.  The reasons for this are well studied (there are dozens if not hundreds of papers on it), and is partly related to the reduction in variability we see across the aggregated values when compared to the original data.  Academics have used the term ‘ecological fallacy’ to describe the consequences of this effect for decades.  The main concern is that since correlations between aggregate data are often not the same as correlations between disaggregate data, one should be very careful about using ecological data to draw conclusions about associations between variables measured at the individual level.

Using some R code I have now posted on GitHub, you too can now create aggregate groups that increase apparent correlation in aggregate data that are uncorrelated at the level of individual observations.  Using this code and the same data that generated the graph above, I can adjust the groupings to show the following association:

These data now appear fairly strongly correlated, however this is entirely due to the aggregation process, not any true underlying correlation between the variables.

The algorithm I used to do this is very simple, and involves shuffling around group membership before calculating the correlation between group means.  For illustration, imagine you had data that looked like this,

and then calculated the Pearson correlation for the group means (say 0.15).  If we then swap around some of the groupings a little (see highlighted rows),

we may find that it increases the resulting correlation.  The algorithm keeps changes that increase the apparent correlation, and over time, is guaranteed to increase the apparent correlations between group means.

In the real world, examination of these artificial groupings would reveal some quantitative trickery.  But the cute thing about the algorithm is that one could start with sensible groups (say, based on geography or time periods) and let the algorithm make a small number of changes to increase the correlation a modest amount–small enough changes that one could perhaps evade detection, but still produce the desired effect.

What is the meaning of this?

Using this method, you can see that it is fairly easy to manipulate data to show pretty well any association you want.  As I mentioned at the top of the post, experienced data analysts can usually sniff out this kind of stuff fairly easily, but a careful data fraudster could probably escape all but the most careful scrutiny.  It’s a good reason to never aggregate data unnecessarily, and when you do aggregate data, aggregate them into groups that make sense and are widely accepted.