On names and citations: part I

Have you ever wondered if the name of a researcher has a systematic influence on the citations of their work?  Me neither, until recently, when a student told me that he read somewhere that people with surnames starting with letters near the end of the alphabet are cited less than people with names starting with letters near the beginning of the alphabet.  I did a quick look for literature on this subject, but didn’t find anything.  So naturally, I set out to answer the question myself.  Specifically:

Does having a long name that starts with a letter late in the alphabet influence the number of citations a researcher’s work receives?

Approach

PubMed is a search engine for medical research; if you go to the PubMed site, you can search for articles (and retrieve abstracts) on any medical subject that you are interested in.  It has an API (application programming interface) that allows one to search and retrieve cited material within the PubMed system programmatically.  This allows for fast access and searching on a large scale that can’t be done by searching manually.  I used the R library rentrez to access this API and gather PubMed data to answer my research question. I didn’t see it wise (or possible) to try to pull all the PubMed data offline, so I took a random sample of articles based on the unique PMID identifier that PubMed uses to uniquely identify articles.  I did this by generating 30000 random numbers between 1000000 and 9999999, and pulling down information on a all Pubmed articles with PMIDs in this list of numbers.  Some randomly generated PMIDs did not have a corresponding PubMed record, but most in this interval did, and this left me about 28000 articles.  Based on the PMID numbers I generated, publication years of the retrieved articles were between 1963 and 1999. The list of articles excludes all literature that is not indexed in PubMed, so it’s not a perfectly representative sample of all academic research, though it is pretty representative of health/medical research over the time period.

For each article, I summed the number of PubMed citations of that article by using the API’s link function.  This is restricted to links within PubMed sources, so is an under count of total citations.  I also identified the number of authors per article, identified the surname of the first author, the length of the first author’s surname and the first letter of the first authors surname.

I then found a data file from the 2010 US census which has estimates of ethnicity for all the surnames of about 90% of the US population.  These data include self-identified ethnic background (White, Black, Asian, Hispanic and other), and are represented as proportions; a surname with 1.00 white would mean that 100% of the US population with that surname identify as white.  I linked this list of surnames to the surnames in the PubMed data.  I dropped all records that did not link to the US census.  The final data set has 19219 records in it.

To analyze these data I created a model that predicts the number of citations as a function of surname length, first letter, proportion white ethnicity, number of authors, year and year squared.  I used negative binomial regression because the dependent variable is a discrete count, and over-dispersion seemed likely (the variance of the dependent variable is larger than the mean of the dependent variable).

Findings

The result of the main model is here:

What does this table suggest?  Well first, the impact of first letter seems pretty small, however long names seem to be a liability when it comes to citations per article.  Further, the number of authors has a considerable impact on citations (more authors means more citations) and having a name that is typically associated with white self-identification is also associated with more citations.

To help contextualize these results, I considered two scenarios related to my name.  Scenario 1 is the ‘real world’ in which I have the ridiculously long surname Yiannakoulias I was assigned at birth.  Scenario 2 assumes that I took my mother’s surname (which is short, and starts nearer to the front of the alphabet).  What is the difference between these scenarios if we were to estimate citations of a paper published in 1999 with 5 co-authors?

Scenario #1 (real-world) predicted citations: 7.07

Scenario #2 (mother’s surname): 9.34

Aha!  My surname name is a liability!  Or is it?  I ran the same model again, but this time adding an interaction term between length of name and self identified ethnic status.  My intuition is that a long name that is ‘white’ is different from a long non-white/less-white name.  The results of this updated model are here:

I also have model predictions of citations per paper based on the same two scenarios:

Scenario #1 (real-world) predicted citations: 8.39

Scenario #2 (mother’s surname): 9.16

This result suggests the whiteness of my name offsets most of the disadvantage of having a long name, and the letter ‘Y’ doesn’t really have much of an effect.  This answers the question I set out to answer from the beginning.

Conclusion

There are a number of potential weaknesses to this analysis; the main one is the link to the list of surnames.  People with very uncommon surnames are excluded from my analysis, which could definitely introduce a bias in the findings–unusual surnames may be cited less by virtue of their infrequency.

Nevertheless, I think I can probably draw some tentative conclusions here; as much as I’d like to chalk up my scholarly mediocrity to the misfortune of having a cumbersome surname, I can’t.  My name is not much of a liability to my academic career.  More generally, having a name with a first letter starting at the end of the alphabet doesn’t really matter when it comes to PubMed citations, and even long names don’t matter much–provided they are ‘white’ names.  This somewhat alarming ethnic bias requires some more exploration, which I will look into in part II of this analysis in the upcoming weeks…