Using Ngram to measure trends in spelling mistakes

Ngram is a database of words published in books.  It is a convenient and fascinating resource for all sorts of crazy stuff interesting to linguists, English professors and other word lovers.  You can search the database (or even download it) to see the patterns of word frequency in books published as far back as the 16th century.  Here is a reference to a scholarly article on the subject.

To see an example of what Ngram can do, I provide a trivial example.  When I was a kid, my friends and I often debated the correct colloquialism for underwear: ginch, gotch or gitch.  Using Ngram, I can see the frequency of usage in books published in English, and resolve the debate once and for all.  Gotch wins!

gotch ginch gitch

(Slightly) more seriously, I used Ngram to look at the changes in frequency of misspelled words in published books between 1800 and 2000.  In particular, I was interested to see if spelling improved between 1980 and 2000 — a period which covers the introduction of personal computing and the computerized spell-checkers in word processing software. You can view the data I compiled from Ngram here.

I focus on three common mispellings: occassionaly, recieve and beleive

Using the Ngram data, I calculated the ratio of the fractional use of the misspelled word to the fractional use of the correctly spelled word.  This ratio is an attempt to control for the secular changes in word use.  For example, maybe the word “believe” was more commonly used in books in the past than it is today.

I then graphed out the result:Spelling2017

We can see that up until the 1980s, the publication of these misspelled words is trending upwards, but by about 1980, there is a rapid decline.  Here is a close-up of the last few decades of data:


Over this period of time, misspellings of all three of these words declined in a way consistent with the (utterly unsurprising) hypothesis that computerized spellcheckers improve spelling.  However, it is worth noting (and is somewhat surprising) that the misspelled variants have not entirely disappeared from published books, and nor have they reached the relatively lower spelling error rates seen in the early 19th century.