Fishtomatoes and the `backfire effect`

Early in 2015 a story started circulating on social media about a young man dying from an allergic reaction to eating a genetically modified tomato.  The source of the story,  World News Daily Report, claimed that a young Spanish man named Juan Pedro Ramos died after eating a tomato ‘containing fish genes’.  He was ‘known to have allergies’ but did not recover following the injection of epinephrine, and died later in hospital.

fish_tomato

The idea of the ‘fishtomato’ has origins in fact; over a decade ago a company did try to put a fish gene into a tomato (Schmidt 2005).  However, the demise of Juan Pedro Ramos as told here is almost certainly not true.  No reputable news source reported this story, and I can’t find evidence of fish-tomato hybrids on the market in Europe or anywhere.

Nevertheless, it is not easy to prove that the story is untrue; I can’t find evidence for the absence of this (or most) events, I can merely fail to find the evidence and assume that it must not have happened.  This is a familiar epistemological challenge: it is hard (and perhaps impossible) to disprove a specific claim of an event after the fact.  I can go to all the hospitals, survey all the death records and interview every family member of every Juan Pedro Ramos in Spain, but even after finding no evidence that this event took place, I still can’t know with certainty that the story is a fabrication.

Normally this isn’t a big deal.  There are few certainties in life; more often than not we  decide what is ‘true’ and ‘false’ by weighing the balance of evidence, not by knowing anything with 100% certainty.  Unfortunately, as we all know (and as I’ve pointed out in my own example), the internet makes the spreading of false information very easy, and the sheer volume of falsehoods makes decision making based on evidence increasingly difficult.

A multiplier of concern is that once a person formulates a belief based on false evidence, corrections may not only fail to change their mind, but could actually reinforce their belief in falsehood.  In 2010 Brendan Neyhan and Jason Reifler published a paper called When Corrections Fail: The persistence of political misperceptions in which they study the relationship between belief, information and ideology.  Basically, they wanted to know if being presented with a correction of a currently held misperception of a political or economic fact would correct that misperception.  Perhaps unsurprisingly, they found that people with the strongest beliefs most resisted correction of those beliefs, and even worse, would develop a stronger attachment to their misperceptions when presented with facts pointing to the contrary.  This is the ‘backfire effect’, where attempts to change minds with good evidence accomplishes the precise opposite.

So these fishtomato stories are not only problematic because they are plentiful and false, but because they nurture a belief that is hard to reverse even in the face of clear facts to the contrary.  There are almost certainly thousands if not millions of people around the world that have come across this story, and some of these people now believe that there are fish tomatoes in their grocery stores.  And if any of those people read this blog post, they are almost certainly not going to be convinced to change their mind…

Statistical inference, bias and big data

These days Big Data is all the rage.  As you might expect, big data describes large data sets, and more specifically, data sets that are so big that traditional analytical tools are often not sufficient to make sense of them.  Much of the big data revolution originates with the internet, mainly because the internet makes the dissemination of data cheaper and easier, but also because the internet generates big data itself.

On the surface, it might seem that bigger is always better.  Often this is probably true; for example, in experimental research designs more data means more generalizability, which means the research impact is broader, and the findings are less likely to be due to chance.  But the flip side is that big data can obscure mistakes, and make things that are unimportant seem important.  In short, everything that makes big data good at finding new and interesting discoveries is also good at finding things that are unimportant and not true.

Consider the following model:

model

The outcome is something of interest, for this example, life expectancy.  The true determinant is something that has a direct and real influence on the outcome–like exercise. The bias is a factor that has no real influence on the outcome–say hours spent reading the National Post–but has an apparent effect because of systematic error in the collection or analysis of data.  The random error is the sum of all other unmeasured determinants that might explain the outcome.

Random error is not really a problem provided it is independent of the other determinants of the process. Indeed, big data has an indirect way of reducing the impact of random error; generally speaking, the larger the data set, the easier it is to measure effects that are small in relation to the size of the random error.  So if exercise has a very small influence on life expectancy (and most of the variation in life expectancy is contained in the random error), with large enough data I will still be able to measure the effect of exercise on life expectancy.

Bias is the most worrisome part in the process I’ve described here.  Bias is systematic error, which means that it doesn’t just obscure effects (like random error) but misleads us by making an effect look to have a different relationship with an outcome than what is actually the case.  Specifically, it either makes an important effect look unimportant, or an unimportant effect look important.  In small data sets where the true effects are strong (such as the association between smoking and lung cancer), small biases are probably of little concern because small data sets are less likely to reveal them.  Large biases are always a concern, but careful review of methods will often catch these a priori, so they can be filtered out of research by rigorous scientific review. Small biases can become a problem in big data, however, because big data can detect small effects, particularly if researchers use ‘statistical significance’ as a benchmark for accepting or rejecting an effect as important.

To illustrate the problem, I offer the results of a computer simulation based on the model above

N bias and inference

Each line shows the change in ‘p value’ with increasing size of the data set.  P values are often used to make inferences about effects–the smaller the p value, the more likely a researcher will accept an effect as interesting (‘statistically significant’).  The x-axis is the log of data set size to help with the interpretation of the graph; a value of 5 is a data set of about 150 observations, and a value of 9 is a data set of about 9000 observations.  The colours correspond to the amount of bias–basically, a rough reference for the magnitude of systematic error.  A value of 10% is a bias with a magnitude 10% that of a true effect of interest, a value of 50% is a bias with a magnitude half that of a true effect, and the blue circles correspond to bias roughly the size of the true effect.

In this simulation small errors (10% or less) are unlikely to lead to regularly spurious conclusions in data sets with less than 5 to 10 million records.  But bias of a size half that of a true effect becomes a problem once data sets get to be 5000 observations or more in size; it is at this point that p values get to be well below the 0.05 threshold that some researchers use to determine importance of an effect.

This simulation isn’t terribly generalizable, but it clearly illustrates the (very predictable) point that big data is better at finding effects–both real ones, and systematic errors–than small data.  It is hard to comment about the scope of the problem in practice, but it is enough of a problem that some applied statisticians have questioned the value of ‘statistical significance’ in the era of big data.  There are various solutions to the problem, but my preferred option is to focus on effect size; for large data sets assume that all measures are ‘significant’, but then describe the importance of the effects with respect to magnitude.  I will discuss some examples of this in future posts.

Canadian election 2015: the social sciences are hard

The 2015 Canadian Federal election is over, and the results seemed a surprise to many. In the days just before the election, some sources predicted the Liberal majority (like Mainstreet polling), and some polling firms (like Nanos) came very close on the popular vote nationally.

election2015

2015 Canadian Popular Vote Predictions

In 2011 many federal pollsters were wrong, but seat projections assuming correct polling numbers were pretty good.  In other words, if the regional polls had been accurate in 2011, then the seat count prediction would have been very close to the election result.  This time around the polls predicted popular support (even regionally), but few analysts predicted the Liberal majority, and those who did waited to make their predictions until the last minute.

Consider Eric Grenier’s prediction:

grenier

Seat projections from the CBC

To his credit, Grenier did predict a Liberal victory, but considered 184 as a long-shot; his maximum seat projection for the Liberals was 185, but his best prediction was 146 seats.

The task of predicting an election outcome is, like many predictions of human systems, very hard, particularly in the weeks or months before an election.  It’s hard for many reasons.  For one, the data (stated preferences in regional samples) are not a perfect representation of behaviour on election day.  Many people change their minds moment-to-moment, influenced by new information, personal experiences, and gut feelings that can emerge spontaneously throughout a campaign period.  In addition, it’s hard to get representative samples of voters from polls.  Different polling methods can produce different results, and we are far from understanding the full effect of new technologies on voting behaviour.

Perhaps the greatest challenge to predicting election outcomes is that that the very publicizing of information used to predict elections may itself affect election results.  In this election it seemed that once a few polls in early October started to favour the Liberals over the NDP, the support for the NDP collapsed.  The electorate wanted change, but may have waited until very late in the campaign to decide where to put the ‘change’ vote.  For many people it seems that their final voting choice was dependent on polling information late in the campaign.  One can’t help but wonder if the NDP were reported to have polling momentum a few weeks ago if the snowballing of anti-conservative support might have rolled in their favour.

If predictions of results can influence future results, it means that predictions of election results may be inaccurate the instant they are made (particularly when made long before voting day), but could still be an important tool in strategic voting.  It makes for a complex dynamic of information dependence–where current polling (and or voting) may be dependent on the public dissemination of prior polling information.  Predicting final vote results in the face of such dependence is difficult, and perhaps even impossible.  Yet more evidence that the social sciences are hard.  That’s what makes them interesting.

Predicting Olympic 100m sprint race times

I have long wondered about the future of the Olympic sprint.  Specifically, at what point will athletes reach the limits of human ability where basically we simply can’t run much faster. If that day ever comes, race finish times will be more or less identical, and winners will have to be identified using photographs at the race line.  If that ever does happen, I suspect that the event may lose some of our fascination.

I went to an online repository of Olympic data (http://www.databaseolympics.com) from the first Olympics until 2008 and did a little analysis, basically modelling the average race result times of medal winners by Olympic year.  The data and R code can be downloaded for your own use.  My intent was to model the shape of the historical trajectory of race results, stratified by sex, to see whether or not there is an inflection point.

The model suggests that race times improved greatly since the first Olympics, but that the rate has slowed the last 30 years, especially for men.  The figure below shows the model predicted race times by year for women and men.  I use the predicted values because the model has an R-squared value over 0.90, suggesting that most variation over time is ‘explained’ by the model, and predicted values are easier to interpret than the observed data.  Predictions are based on a model that only includes statistically significant regression coefficients.

Predicted race times by year

If this pattern is predictive of the future, we can expect men’s race times to change little going forward; it seems that for men the natural speed limit lies in the 9.6 to 9.8 second range.  It could be that few people will ever be capable of running times in that range, so the races will always be dramatic contests between the uniquely speedy.  However, world records probably won’t change as much as they did in the past unless officials keep adding digits to the time clock.

For women, there seems to still be room for improvement, as the race results up to 2008 have not flattened out.  Perhaps women could still shave of a quarter of a second from race times, or perhaps even more than that.  Could this mean that women sprinters may one day catch men (literally and figuratively)?  It’s possible, but assuming that the rate of improvement remains the same, it would probably take a couple hundred years.

Distance from work

This past weekend I did a little experimenting with Google Sheets.  I used a function called REGREPLACE to return the search results on a phrase:

“I live n miles from work”

where n takes values between 1 and 100 in 5 unit intervals (except for the first interval, which was only 4 units, so 1, 5, 10, etc. all the way to 100).  This gave me a table of the number of web pages in the Google index with the above phrase, but for different values of distance.  I used both the number n (e.g, 5) and the written value (e.g., five) to be reasonably complete about it.

Here is a graph summarizing the results:

distance from work

 

Each dot on the graph is the # of web page results with the search phrase (“I live n miles form work“).

What does this tell us?  Well, the frequency of pages with the phrase “I live 1 mile from work” and “I live 5 miles from work” seem to be the most common, but it doesn’t say much about how far people actually live from work.  This is not a random sample, after all.

The more interesting thing to me is the zig-zag pattern where most tenth intervals (20, 30, etc.) are higher than  their neighbouring fifth intervals (25, 35, 45, etc.).  This pattern is almost certainly not because people are actually more likely to live 30 rather than 25 miles from work, or 40 rather than 35 miles from work.  So what’s going on?

It seems these data are telling us something about rounding behaviour; when thinking about a distance between where we live and where we work, we seem more likely to estimate that distance to the nearest 10th than the nearest 5th.  This is worth keeping in mind when we ask people questions about distance, particularly if people that round to the nearest 10th are somehow different from people who round to the nearest 5th.  Newer residents of a city may round to the less precise 10th compared to more established residents, for example.  Understanding this rounding behaviour may be useful for improving how we understand perceptions of distance, but it’s also a good reminder to interpret these estimates with some caution, particularly for short distances; rounding from 44 to 40 is only 10% error, but rounding from 24 to 20 is double that.

It remains to be seen if the error is the same with respect to travel time…