Using Ngram to measure trends in spelling mistakes

Ngram is a database of words published in books.  It is a convenient and fascinating resource for all sorts of crazy stuff interesting to linguists, English professors and other word lovers.  You can search the database (or even download it) to see the patterns of word frequency in books published as far back as the 16th century.  Here is a reference to a scholarly article on the subject.

To see an example of what Ngram can do, I provide a trivial example.  When I was a kid, my friends and I often debated the correct colloquialism for underwear: ginch, gotch or gitch.  Using Ngram, I can see the frequency of usage in books published in English, and resolve the debate once and for all.  Gotch wins!

gotch ginch gitch

(Slightly) more seriously, I used Ngram to look at the changes in frequency of misspelled words in published books between 1800 and 2000.  In particular, I was interested to see if spelling improved between 1980 and 2000 — a period which covers the introduction of personal computing and the computerized spell-checkers in word processing software. You can view the data I compiled from Ngram here.

I focus on three common mispellings: occassionaly, recieve and beleive

Using the Ngram data, I calculated the ratio of the fractional use of the misspelled word to the fractional use of the correctly spelled word.  This ratio is an attempt to control for the secular changes in word use.  For example, maybe the word “believe” was more commonly used in books in the past than it is today.

I then graphed out the result:Spelling2017

We can see that up until the 1980s, the publication of these misspelled words is trending upwards, but by about 1980, there is a rapid decline.  Here is a close-up of the last few decades of data:

SpellingRecent

Over this period of time, misspellings of all three of these words declined in a way consistent with the (utterly unsurprising) hypothesis that computerized spellcheckers improve spelling.  However, it is worth noting (and is somewhat surprising) that the misspelled variants have not entirely disappeared from published books, and nor have they reached the relatively lower spelling error rates seen in the early 19th century.

 

Interpreting weak effects in large studies: is dementia associated with proximity to roads?

An Ontario study investigating the risk of dementia associated with living near major highways has been getting press attention from around the world recently.  The results report that living near a major highway is responsible for a 7% increase in risk of dementia, a fairly small effect compared to many epidemiological studies.  As an academic with a healthy dose of natural skepticism (bordering on an unhealthy dose of cynicism) I was immediately doubtful of the authors’ research findings, so I read the abstract and skimmed the paper.  I saw some study design weaknesses that when combined with the small effect size, suggest that the results should be interpreted with great caution, and reported with considerable qualification.  I will briefly comment one specific problem I see with the research, though there are several worth consideration.

What’s the problem?

The study design is a retrospective cohort, and very large; for this particular part of the study, there were over 2 million persons involved.  Large study sizes have some important advantages; for example, they mean results are likely to be more generalizable to the population as a whole. They also have more power to detect weak (though ‘statistically significant’) effects.  As I’ve noted other times on my blog, large data are better at detecting all effects–true and false.  For this reason, big data research has to be as rigorous as possible–one small systematic error can be enough to greatly affect the interpretation of the data, particularly when effects are small.

In this study, one important methodological shortcoming should cast some doubt on the observations the researchers make.  Specifically, the authors have not properly controlled for the confounding effect of income.

There is evidence that dementia has an association with income [1,2], and evidence that lower income is associated with living closer to major highways [3].  If the effect of income were not taken into account in this research, it could bias any association between dementia and living near a road–‘confounding’ our interpretation of the effect of interest.  In this particular case, the likely confounding is to produce a positive bias in the effect of interest, making the relationship between living near a busy highway and dementia seem stronger than it actually is.

To control for this, the authors did not use subject income, but rather, used neighbourhood income from the Canadian 2001 census.  Neighbourhood income is an imprecise measure of individual income, and therefore does not fully resolve the confounding problem.  Indeed, what remains is residual confounding, the effect of which is (in this situation) a probable bias in the estimated association between dementia and living near a busy highway.  This is true even if the error in neighbourhood income is random.  How much of a bias is unclear, but given the small size of the detected effect, could easily undermine the main conclusion of the paper.  You can see a simple example of this effect in this Google Sheet I prepared.

The media’s role

In spite of this, the research is probably still publication worthy.  The fundamental science is not unreasonable–which is to say, there is a plausible biological explanation for how exposure to air pollution could result in some effect on human systems–including the brain.  Furthermore, this study is building on other research in this area [4].  However, the modest observed effect combined with the methodological shortcomings (specifically residual confounding of income) require a high degree of qualification on the part of the authors, as well as the media writing about it

Unfortunately, once research like this gets reported in the media (and pumped by the PR staff at the journal and the affiliated universities), qualifications are often lost–especially in newspaper headlines.  As of today (January 6, 2017), here are some headline examples:

ex6 ex5 ex4 ex3 ex2 ex1

Perhaps most readers will be thoughtfully suspicious about the results of this research, or follow up with a critical analysis of the original article, but I doubt it.  It is quite likely that many people will file these headlines into their memories as evidence of something substantive–perhaps that highways are causing dementia, or that academics have a nefarious agenda to attack motor-vehicle culture.  In either case, promoting this particular study as an important contribution to our understanding of the environmental risk factors for dementia is problematic since it lacks the rigour to justify influencing public assessments of risk or our understanding of the world.

My conclusion

The apparently biggest strength of the study (the size of the cohort) is part of the problem, since it is the study size that makes the result seem important. Ceteris paribus, a large study with small biases is more likely to produce small but ‘statistically significant’ false effects than a small study with small biases.  For this reason, I think it is often good practice to interpret effect size and study size together, and that one should be especially suspicious of large studies with small apparent effects.  Large studies with methodological flaws are becoming more common in this era of big data, which means that researchers, policy makers and the public need to be more vigilant than ever, and take great care in their interpretation of findings.