On names and citations: part I

Have you ever wondered if the name of a researcher has a systematic influence on the citations of their work?  Me neither, until recently, when a student told me that he read somewhere that people with surnames starting with letters near the end of the alphabet are cited less than people with names starting with letters near the beginning of the alphabet.  I did a quick look for literature on this subject, but didn’t find anything.  So naturally, I set out to answer the question myself.  Specifically:

Does having a long name that starts with a letter late in the alphabet influence the number of citations a researcher’s work receives?


PubMed is a search engine for medical research; if you go to the PubMed site, you can search for articles (and retrieve abstracts) on any medical subject that you are interested in.  It has an API (application programming interface) that allows one to search and retrieve cited material within the PubMed system programmatically.  This allows for fast access and searching on a large scale that can’t be done by searching manually.  I used the R library rentrez to access this API and gather PubMed data to answer my research question. I didn’t see it wise (or possible) to try to pull all the PubMed data offline, so I took a random sample of articles based on the unique PMID identifier that PubMed uses to uniquely identify articles.  I did this by generating 30000 random numbers between 1000000 and 9999999, and pulling down information on a all Pubmed articles with PMIDs in this list of numbers.  Some randomly generated PMIDs did not have a corresponding PubMed record, but most in this interval did, and this left me about 28000 articles.  Based on the PMID numbers I generated, publication years of the retrieved articles were between 1963 and 1999. The list of articles excludes all literature that is not indexed in PubMed, so it’s not a perfectly representative sample of all academic research, though it is pretty representative of health/medical research over the time period.

For each article, I summed the number of PubMed citations of that article by using the API’s link function.  This is restricted to links within PubMed sources, so is an under count of total citations.  I also identified the number of authors per article, identified the surname of the first author, the length of the first author’s surname and the first letter of the first authors surname.

I then found a data file from the 2010 US census which has estimates of ethnicity for all the surnames of about 90% of the US population.  These data include self-identified ethnic background (White, Black, Asian, Hispanic and other), and are represented as proportions; a surname with 1.00 white would mean that 100% of the US population with that surname identify as white.  I linked this list of surnames to the surnames in the PubMed data.  I dropped all records that did not link to the US census.  The final data set has 19219 records in it.

To analyze these data I created a model that predicts the number of citations as a function of surname length, first letter, proportion white ethnicity, number of authors, year and year squared.  I used negative binomial regression because the dependent variable is a discrete count, and over-dispersion seemed likely (the variance of the dependent variable is larger than the mean of the dependent variable).


The result of the main model is here:

What does this table suggest?  Well first, the impact of first letter seems pretty small, however long names seem to be a liability when it comes to citations per article.  Further, the number of authors has a considerable impact on citations (more authors means more citations) and having a name that is typically associated with white self-identification is also associated with more citations.

To help contextualize these results, I considered two scenarios related to my name.  Scenario 1 is the ‘real world’ in which I have the ridiculously long surname Yiannakoulias I was assigned at birth.  Scenario 2 assumes that I took my mother’s surname (which is short, and starts nearer to the front of the alphabet).  What is the difference between these scenarios if we were to estimate citations of a paper published in 1999 with 5 co-authors?

Scenario #1 (real-world) predicted citations: 7.07

Scenario #2 (mother’s surname): 9.34

Aha!  My surname name is a liability!  Or is it?  I ran the same model again, but this time adding an interaction term between length of name and self identified ethnic status.  My intuition is that a long name that is ‘white’ is different from a long non-white/less-white name.  The results of this updated model are here:

I also have model predictions of citations per paper based on the same two scenarios:

Scenario #1 (real-world) predicted citations: 8.39

Scenario #2 (mother’s surname): 9.16

This result suggests the whiteness of my name offsets most of the disadvantage of having a long name, and the letter ‘Y’ doesn’t really have much of an effect.  This answers the question I set out to answer from the beginning.


There are a number of potential weaknesses to this analysis; the main one is the link to the list of surnames.  People with very uncommon surnames are excluded from my analysis, which could definitely introduce a bias in the findings–unusual surnames may be cited less by virtue of their infrequency.

Nevertheless, I think I can probably draw some tentative conclusions here; as much as I’d like to chalk up my scholarly mediocrity to the misfortune of having a cumbersome surname, I can’t.  My name is not much of a liability to my academic career.  More generally, having a name with a first letter starting at the end of the alphabet doesn’t really matter when it comes to PubMed citations, and even long names don’t matter much–provided they are ‘white’ names.  This somewhat alarming ethnic bias requires some more exploration, which I will look into in part II of this analysis in the upcoming weeks…


Publicly funded snow shovelling

In preparation for a class this week, I have done a simple little analysis to explore the economic case for publicly funded snow shovelling.   A number of studies have been done on the impacts of snow shovelling on heart attacks, and in the more general field of physical exertion and heart attacks.  Given the research, it seems reasonable to consider whether or not a public system of snow shovelling could save money.  Below I describe each step in the analysis.

1. Shovelling snow causes heart attacks

According to Auger et al., 2017, a man’s risk of myocardial infarction (MI) is 1.34 times higher the day following a major snowfall (20+ cm of snow) compared to no snowfall.  This tells relative risk; that is, the risk associated with snowfall compared to no-snowfall.  In order to assess the impact this has on public health, we need to have some measure of absolute heart attack risk–like an incidence rate.  It’s tricky to estimate the incidence rate in this case, however.  Choosing the annual incidence rate (around 0.002) is way too high, since this incidence is estimated over the whole year.  Indeed, the probability of heart attack on a day with 20 or more centimetres of snow is probably at least two orders of magnitude lower than the annual incidence rate.  Let’s assume that the baseline incidence is 0.00002–this is roughly the daily risk of MI.  This means that every day there is a 20 cm snowfall, the risk is of heart attack is 0.0000268 (0.00002 x 1.34).

2. There are 160,000 detached houses in Hamilton

How many men are exposed to the hazards of snow shovelling?  I base this estimate on the 2011 NHS, from which I pulled the number of households in Hamilton, and divided it by two.  This assumes half the time women shovel, and half the time men shovel, and that on average, every household has one man in it.  This gives us 80,000 men exposed to shovelling.  This is probably an overestimate, since some people hire snow-clearing companies, and some households just don’t bother shovelling at all.

3. The risk attributable to snowfall is…

Based on step 1, the risk of heart attack on the day following 20 cm of snowfall = 0.0000268.  The risk attributable to shovelling is the difference between the risk of heart attack among the exposed and the baseline risk: 0.0000268 – 0.00002 = 0.0000068 (or about 6.8 per million people).

To find out how many people suffer heart attacks in Hamilton due to a major snowfall event, we simply multiply the attributable risk by the exposed population:

0.0000068 x 80,000 = 0.544

These numbers suggest that once every two years, someone has a heart attack as a result of shovelling snow in Hamilton, assuming there is one 20 cm snow event per year.  One 20 cm snow event per year is probably a bit of an over-estimate, but reasonable enough based on these data.

4. How much does a heart attack cost the economy?

A complete assessment should include all costs, including health care, lost productivity, etc.  However, many of these costs are very hard to measure.  In Canada, health care costs are around $30,000 per MI.  However, other losses could be greater, especially in the long run, and especially if we tried to price the value of a life.  Let’s say that each heart attack costs $150,000.

5. Cost-benefit

If we multiply 0.544 x 150,000 we get the expected annual costs of heart attacks due to big snowfall events in Hamilton.  This gives us about $75,000 a year.  Given the large number of households to shovel (160,000) and the costs of shovelling them (even a modest $25 per household per season costs $4,000,000 a year) there is no economic case for snow shovelling, at least when it comes to heart attacks.

When put into the context of public health impact for a small city like Hamilton, the results of Auger et al., 2017 are not particularly compelling.  Even if the relative risk estimate is correct, the actual impact on the population in this city is probably pretty small.  For the entire country, the impact is greater; I’d ballpark it at 40 heart attacks as a result of shovelling snow, and maybe 3 or so deaths a year.  Still, given that there are around 50,000 deaths a year (in Canada) as a result of heart disease, it seems that the impact of snowfall on MI is pretty small.

It’s also worth noting that snow shovelling may also have health benefits–like exercise, and sharing time with neighbours.  Since exercise improves health, and possibly saves money, this would make the economic case for publicly funded snow shovelling even weaker than what’s presented here.

Perils of user generated epidemiology

I recently worked with some students on a short analysis of Lyme disease content on YouTube.  I thought the results of the work were worth writing up for publication.  The paper will be published in the journal Social Science & Medicine in the fall, and the online manuscript version is available for download until October 29th:


The findings are pretty intuitive.  There is considerable YouTube content on Lyme disease, but most of it is neither scientific nor focused on infection prevention.  Personal stories and videos about celebrities are popular and are among the most common content available.  Public health agencies and academics produce very little content, and the content they do produce doesn’t receive much viewer interest. This may explain why the public health content that is most available on YouTube is often inconsistent with best practices recommended by experts in the field.

Given the large number of video views on Lyme disease, and the absence of content published by experts, it seems that public health agencies should create more content for YouTube.  This is not simply a scholarly exercise, but a scientific and even moral imperative–many people get information from YouTube and other forms of online user generated media, and need to be provided better information.  To maximize efficacy, these agencies should generate content that combines evidence-based public health information with personal stories about people who have experiences living in a areas where Lyme disease risk is high.

So bad it’s good

When I was a kid, watching the Littlest Hobo was little more than a unwanted reminder that my family didn’t have cable TV.

Thanks to YouTube, I can now experience the show as an adult, and pretend that its campy low-quality production value was intentional.

Here is a link to episode 1: https://www.youtube.com/watch?v=tnCyMpl4dhk

I think the lessons in this episode are obvious:

  • Don’t start forest fires
  • Don’t leave poisoned meat out for toddlers to consume
  • Dogs are great parachutists

Kindergarten Encryption

My wife and I have discovered the perfect (though time limited) encryption technology–our five-year-old daughter!

Here is the process.  One of us relays a message to our daughter.  She writes it down.  The message is now encrypted.  Nobody other than my wife and I can actually understand what she has written.  In fact, not even my daughter can figure it out.  Here’s an example:

There are a few words in there that you can probably make out, but I’m willing to bet that you can’t translate it in its entirety.  For me?  No problem:

“Thanks for bringing me to the fair. And then Lisa’s cat sped for the loop-the-loops.  Oh, I guess he wants to go on the loop-the-loop. Oh I don’t want to call him ‘my cat’ any more.  Maybe we could call him ‘loopy’!”

Here’s a more challenging example:

Can you figure it out?

So if I had to send a secret message to my wife while she was away from home, I would simply get my daughter to write it down and then give it to a courier.  I wouldn’t have to worry about the message getting intercepted, since anyone who read it would be clueless as to its meaning.

We’re lucky the Germans never figured this out during World War II.  All they needed was to put a kinder-gardener on every U-boat, and we’d have never won the war.

The only problem is that in a year or so my daughter’s writing will improve to the point where pretty well anyone will be able to decipher it.  Yet another reason to wish she wouldn’t grow up so fast…