The cost of research publications for Canadian universities

How much does it cost to publish research?  Obviously this varies considerably across disicpline, but I did a little comparison across a non-random selection of Canadian universities.  The sources of data include Web of Science publication count, Web of science count of highly cited papers and funding data from this source.

Once I combined these data, I came up with this file.

The data include research fund and paper publication totals by year and institution.  Arguably research funds are used to do things other than fund the publication of research–like training graduate students–but this is a rough starting point. Furthermore, published research is a key metric of measuring research output for a university, and at the very least, is probably a good proxy for research output more generally, especially in comparisons over time and between universities.


These data tell us a few interesting things.  First, research productivity is trending upwards.  Click on the figure below to see a larger version of the trends over time.

Papers published by year for different universities

This pattern is true for all Web of Science publications (on the left) and highly cited Web of Science publications (on the right).  The rate of increase is fairly similar across institutions.

Second, universities receive funding in the neighbourhood of $50,000 and $100,000 per Web of Science paper.  This is probably an over-estimate of the per-paper research costs since many papers published do not end up in the Web of Science system.  Perhaps the real number is half this value–say between $25,000 and $50,000 per paper, but this gives us a basis for comparison.

Third, across Canadian universities in this sample, there is some noteworthy variation. Averaged over the period of time I used in this sample, funding per paper varies beween 61K and 89K.

The University of Saskatchewan spends the most per paper, and the University of Toronto spends the least.  This could reflect some economy of scale effect; the U of T is big, and is able to leverage its size to be more productive per dollar spent.  If we look at costs per highly cited paper, we see a similar pattern, but it’s more exaggerated.  You can see that the cost per highly cited paper is 5 times higher for the University of Saskatchewan than the U of T.  Also note that while Queens University publishes papers at a lower average cost than McMaster, McMaster spends much less for highly cited papers, spending roughly half as much per highly cited paper as at Queens.

The good news (depending on your perspective, I suppose) is that the expenditure per paper published is trending downwards.

All Web of Science papers

Between 2008 and 2016, most institutions in this sample see an improvement of publication efficiency in terms of overall numbers, mainly because of an increase in the number of papers published with a stable (or slightly declining) funding level.  Similarly, we seem to be getting more research ‘bang’ today when measured by highly cited publications as well.  Note the variation for the University of Saskatchewan; this is largely due to the small numbers involved.

Only highly cited papers in Web of Science


What does this all mean?  Well, we know, at least within an order of magnitude, how much we should expect to pay for publications, on average.  This certainly varies by discipline, but it gives us a ballpark for comparisons across research intensive universities.  If you can get one paper out for $10,000 or less, then you are probably doing well in most fields.  If it costs more than $100,000, then you probably have a lot of research overhead–in lab equipment, staff, etc.  Some universities seem to spend more–like the Unviersity of Saskatchewan–and some spend less–like the Unviersity of Toronto.  This makes sense given the location of these institutions; the University of Saskatchewan is more isolated geographically, and is the lone research intensive university in Saskatchewan (sorry, Regina).  Toronto is a large institution surrounded by other universities, and at the centre of economic activity in Canada.  This probably allows it to leverage a mix of resources that increases it’s efficiency at publication.

Second, between 2008 and 2016, research funding in Canada did not radically change, but cost per paper went down.  This is mostly because the number of papers and highly cited papers in Web of Science went up.  This could be a good sign; the universities in this sample managed to adapt to a stable (and slightly declining) research funding pot.

The statistic of a statistic problem

One easy (and not uncommon) mistake in data analysis is to calculate a statistic from a statistic without considering statistical weighting.  For illustration purposes, consider the following example.

Say I have data on neighbourhood income and population for a small city.  The table of data look like this:


Perhaps the first thing I want to know is the average income for the city as a whole.  It seems pretty natural to simply take the average of the average incomes across these neighbourhoods.  This would give me an average income for the city of $68,712.  However, this number is incorrect.  Taking the average of the average assumes that each neighbourhood contributes the same amount of information to the city average.  This is clearly not the case.  Gastown has 305 residents, and so the average income of these residents clearly contributes less information to the city average than Zinger Park, which has 3900 residents.

The solution

The solution is to simply take the weighted average.  In this example, the weight is the population in the neighbourhood (perhaps better would be the popualation of employed people in the neighbourhood).  If we sum the products of these weights and average income, and then divide this by the sum of the weights, we get a weighted average.  Here is a table that illustrates this visually:

The red cell is the sum of the product of weights and average income.  The yellow cell is the sum of the weights (in this case, population).  If I divide the red cell by the yellow cell, I get the weighted average (in orange).


Weights are common in statistical data analysis, and their role is usually to adust a statistic based on the information it contains.  In this example it’s pretty straightforward.  None of this is rocket science, but taking an unweighted average of average (or average of proportions, or average of any statistic) is done all the time.  I see it in academic work, public reports and newspaper articles.  It’s an easy mistake to make with a (usually) easy fix.

Is there a surgical mortality cluster in a Florida hospital?

I recently read a story in the National Post about a physician and hospital that were implicated in a CNN story about surgery related deaths in Florida.  It is a useful example of the challenge of communicating health risk in a way that is truthful and useful to the public. The specific issue concerns deaths related to a particular surgical intervention involving newborns with congenital heart anomalies, and whether or not the death rate among these patients at one hospital is higher than the death rates at hospitals across the country. Here are links to some CNN stories on the subject:

The CNN reports suggest that there were 9 surgical deaths between the end of 2011 and June 2015. Based on numbers provided in the stories, it seems there were 27 surgeries in 2012, 23 in 2013, 18 in 2014. Using these numbers, I’ll assume that there were 9 surgeries between January 1 and June 2015. This gives a total of 77 surgeries (approximately) over this period, and a surgical death rate of around 12%. The national average death rate is closer to 3%, meaning that this particular hospital’s death rate is 4 times higher than the national average if we assume all the above numbers are correct.

Now if we further assume that the expected number of deaths is equal to the national average rate times the number of surgeries performed, we should expect around 2.5 deaths at this hospital over this time period, with around 1 in 1000 chance of getting as many as 9 deaths if the true risks of death at this hospital were actually the same as the national average.

Based on these rough calculations, there would seem to be good reason for some follow-up investigation, and CNN has uncovered an important problem.  However, the hospital released information questioning the data CNN used in their report in early June of 2015, claiming that the true surgical mortality rate is 4.9% over the same period.  Furthermore, this hospital reports their data to the Congenital Heart Surgery Database maintained by the Society of Thoracic Surgeons (STS), which supports the 4.9% estimate.

A need for more prospective surveillance

I have not dug into what explains these different results, but I suspect that the hospital’s numbers adjust for differences in patient complexity, and perhaps other patient attributes.  Much of the focus so far has been on the hospital’s surgical record and the rigour of CNN’s reporting, but the deeper issue concerns risk communication and whether or not either of these parties can be expected to fully serve the public’s interest.  CNN is incentivized to tell an engaging story; hospitals are incentivized to perform procedures that are profitable.  Most of the time most people in both these organisations mean well, but these good intentions might sometimes be secondary to institutional, professional and other motivations.

One solution is to improve routine surveillance and public reporting.  Whether performed by government or merely regulated by government, routine and prospective surveillance of surgical outcomes by some impartial third party can help avoid some potential conflicts of interest.  Furthermore, ensuring a regular and routine flow of data into the public sphere could improve public trust.  To some extent, this is done by the STS in the routine reporting of surgical outcomes, though it’s not clear whether the reporting system has any regulatory oversight, and as of late 2017, only includes about two thirds of enrolled program participants across the U.S.

Public trust is not helped by the reporting of false information, or delayed and/or unprofessional reactions of health professionals or hospitals.  Given the stakes of the problem, I feel that this episode would have been avoided altogether had there been a routine and regulated prospective surveillance system with clear thresholds for investigation already in place.  Without such a system, these apparent clusters will continue to emerge, more stories will be told, and more members of the public will feel exasperated by the conflicting information about surgical risk.

Geography of crime: population spillover problem in crime mapping

The problem

Consider the following scenario:

You live in a small low population suburb.  A complex of apartments is built in the neighbouring area.  The population density in the complex is very high, unlike the area in which you live.  Over time, you observe the frequency of petty crime increase in your neighbourhood.  You and your neighbours reason that this is mostly due to the residents of the new apartment complex.  Your complaints about crime germinate into hostile feelings to your apartment dwelling neighbours, and you start to form beliefs about them as people.  More specifically, since a disproportionate number of them also have a different cultural and ethnic background than people living in your neighbourhood, you start to believe that people with this background are more likely to commit crime.

In this scenario, let us assume that the following statements are objectively true:

  1. The number of crimes in the general area has gone up because of the people in the neighbouring apartment complex
  2. Risk of being a victim of crime has gone up in your neighbourhood because of the people in the neighbouring apartment complex

In spite of the above statements, it does not also follow that people in the apartment complex are more likely to commit crime than the people living in the low housing density neighbourhood nearby, or indeed, that there is any difference in anyone’s predilection to commit crime in the area generally.  If the reason for this isn’t obvious, I’ll elaborate on why this is the case below.  However, for the moment, it’s easy to see how a person could connect an increase in frequency of crimes with the predilection for committing crime based on a fairly reasonable and natural assessment of the facts.  The problem is that the frequency and rate of crime do not tell us about the people who commit crime; a geographical cluster of high crime rate can occur even if there is no geographic pattern in the likelihood that people will commit crime.

What is going on?

If all people committed crime in the neighbourhoods in which they lived, and crime rates were calculated only in neighbourhoods, then a neighbourhood’s crime rate could be a pretty good indicator of crime risk as well as the disposition to commit crime.  However, people do not restrict themselves to neighbourhood boundaries; offenders go where the opportunities present themselves, and for some types of crime, that could mean travelling some distance away from home.  There is empirical and theoretical research on how offenders travel to crime which varies by crime type, age and other factors.  For types of crime that are committed outside the home, where crime happens may tell us very little about where the offender is from.

The effect here is a ‘spillover’ in crime from higher population areas to lower population areas that can cause the apparent risk of crime in low population areas to be high even if the disposition to commit crime and the suitable targets for committing crime are geographically constant and non varying (‘spatially homogeneous’).  There has been considerable research [1,2,3,4]  into the spillover effect of public housing on crime in the US (most of which has found little relationship between public housing and crime in neighbouring areas) but I am not sure if anyone has analysed this population spillover effect specifically.

This diagram illustrates how this spillover can occur, and the effect on crime rate.

The two black squares are neighbourhoods–one small population (100 people) one large population (1000 people).  The curved black lines are the ‘trips’ that an offender takes from place of residence to where they commit the criminal offence.  The green circles are the residences, and the red squares are the offence locations.  In this example, the left hand square (a low population neighbourhood) has the same proportion of criminals as the right hand square (high population neighbourhood).  But because of the spillover effect, the crime rate is much higher in the low population area.

Why does this matter?

One of the consequences of this spillover is that it can lead to an inferential fallacy; the fact that crime rates are higher in the low population neighbourhood as a result of the high population neighbourhood could lead to incorrect generalisation about the individuals in the high population neighbourhood.  This spillover in crime could happen even if the people in the high population neighbourhood were less likely than average to commit crimes.  So at the very least, this should serve as a reminder that our intuitions–even when based on data–need to be carefully scrutinised, since we can be fairly easily mislead.

This spillover effect can also influence how we understand and attempt to explain patterns of crime.  I used data from the City of Edmonton to estimate the impact of the spillover on the risk of assault.  The interesting result is that neighbouring populations do seem to impact the risk of assault; the model I used (using 2016 crime and population data) suggests that for every 10,000 more people living in the regions surrounding your neighbourhood, there is between a 1.20 and 2.24 increased risk of assault.  To put this into perspective, the average person’s baseline annual risk of assault is 7 per 1000 (or 0.7% chance of being assaulted per year).  If you lived in a neighbourhood surrounded by a population of 20,000 people, then your risk of being assaulted is between 1% and 3.5% per year.

If you’d like to see how I did the analysis, follow this link to GitHub.


This population spillover effect is consequential for two reasons.  First, it may influence how people perceive their neighbours, but in ways that are almost certainly not helpful to social cohesion and sense of community.  Living next to an apartment building may slightly increase your risk of being a victim of some crimes, but this observation says nothing specifically about apartment dwellers’ dispositions to commit crime.  It could be that apartment dwellers as individuals are even slightly less likely to commit crime, and yet, could still be (as a geographical group) responsible for an increase in the rate of crime in nearby communities.  Come to think of it, this is an interesting example of the ecological fallacy!

My preliminary analysis also suggests that incorporating the effect of this spillover into our analyses could improve predictions of crime, although this is likely to vary by crime type.

Finally, the concept may be useful for indirectly estimating the range of criminal activity; it is possible that the presence/ absence of crime spillover may indicate how far people may travel to commit crime.  If there is no spillover effect, then crime may occur closer to home.  If the spillover effect is strong, then it could suggest that offenders travel to commit crime.  I modelled the spillover effect for assaults as well as thefts from inside vehicles, and the latter showed no spillover effect.  This could tell us something about the travel behaviour of offenders for these two types of crimes.