Geography of crime: population spillover problem in crime mapping

The problem

Consider the following scenario:

You live in a small low population suburb.  A complex of apartments is built in the neighbouring area.  The population density in the complex is very high, unlike the area in which you live.  Over time, you observe the frequency of petty crime increase in your neighbourhood.  You and your neighbours reason that this is mostly due to the residents of the new apartment complex.  Your complaints about crime germinate into hostile feelings to your apartment dwelling neighbours, and you start to form beliefs about them as people.  More specifically, since a disproportionate number of them also have a different cultural and ethnic background than people living in your neighbourhood, you start to believe that people with this background are more likely to commit crime.

In this scenario, let us assume that the following statements are objectively true:

  1. The number of crimes in the general area has gone up because of the people in the neighbouring apartment complex
  2. Risk of being a victim of crime has gone up in your neighbourhood because of the people in the neighbouring apartment complex

In spite of the above statements, it does not also follow that people in the apartment complex are more likely to commit crime than the people living in the low housing density neighbourhood nearby, or indeed, that there is any difference in anyone’s predilection to commit crime in the area generally.  If the reason for this isn’t obvious, I’ll elaborate on why this is the case below.  However, for the moment, it’s easy to see how a person could connect an increase in frequency of crimes with the predilection for committing crime based on a fairly reasonable and natural assessment of the facts.  The problem is that the frequency and rate of crime do not tell us about the people who commit crime; a geographical cluster of high crime rate can occur even if there is no geographic pattern in the likelihood that people will commit crime.

What is going on?

If all people committed crime in the neighbourhoods in which they lived, and crime rates were calculated only in neighbourhoods, then a neighbourhood’s crime rate could be a pretty good indicator of crime risk as well as the disposition to commit crime.  However, people do not restrict themselves to neighbourhood boundaries; offenders go where the opportunities present themselves, and for some types of crime, that could mean travelling some distance away from home.  There is empirical and theoretical research on how offenders travel to crime which varies by crime type, age and other factors.  For types of crime that are committed outside the home, where crime happens may tell us very little about where the offender is from.

The effect here is a ‘spillover’ in crime from higher population areas to lower population areas that can cause the apparent risk of crime in low population areas to be high even if the disposition to commit crime and the suitable targets for committing crime are geographically constant and non varying (‘spatially homogeneous’).  There has been considerable research [1,2,3,4]  into the spillover effect of public housing on crime in the US (most of which has found little relationship between public housing and crime in neighbouring areas) but I am not sure if anyone has analysed this population spillover effect specifically.

This diagram illustrates how this spillover can occur, and the effect on crime rate.

The two black squares are neighbourhoods–one small population (100 people) one large population (1000 people).  The curved black lines are the ‘trips’ that an offender takes from place of residence to where they commit the criminal offence.  The green circles are the residences, and the red squares are the offence locations.  In this example, the left hand square (a low population neighbourhood) has the same proportion of criminals as the right hand square (high population neighbourhood).  But because of the spillover effect, the crime rate is much higher in the low population area.

Why does this matter?

One of the consequences of this spillover is that it can lead to an inferential fallacy; the fact that crime rates are higher in the low population neighbourhood as a result of the high population neighbourhood could lead to incorrect generalisation about the individuals in the high population neighbourhood.  This spillover in crime could happen even if the people in the high population neighbourhood were less likely than average to commit crimes.  So at the very least, this should serve as a reminder that our intuitions–even when based on data–need to be carefully scrutinised, since we can be fairly easily mislead.

This spillover effect can also influence how we understand and attempt to explain patterns of crime.  I used data from the City of Edmonton to estimate the impact of the spillover on the risk of assault.  The interesting result is that neighbouring populations do seem to impact the risk of assault; the model I used (using 2016 crime and population data) suggests that for every 10,000 more people living in the regions surrounding your neighbourhood, there is between a 1.20 and 2.24 increased risk of assault.  To put this into perspective, the average person’s baseline annual risk of assault is 7 per 1000 (or 0.7% chance of being assaulted per year).  If you lived in a neighbourhood surrounded by a population of 20,000 people, then your risk of being assaulted is between 1% and 3.5% per year.

If you’d like to see how I did the analysis, follow this link to GitHub.


This population spillover effect is consequential for two reasons.  First, it may influence how people perceive their neighbours, but in ways that are almost certainly not helpful to social cohesion and sense of community.  Living next to an apartment building may slightly increase your risk of being a victim of some crimes, but this observation says nothing specifically about apartment dwellers’ dispositions to commit crime.  It could be that apartment dwellers as individuals are even slightly less likely to commit crime, and yet, could still be (as a geographical group) responsible for an increase in the rate of crime in nearby communities.  Come to think of it, this is an interesting example of the ecological fallacy!

My preliminary analysis also suggests that incorporating the effect of this spillover into our analyses could improve predictions of crime, although this is likely to vary by crime type.

Finally, the concept may be useful for indirectly estimating the range of criminal activity; it is possible that the presence/ absence of crime spillover may indicate how far people may travel to commit crime.  If there is no spillover effect, then crime may occur closer to home.  If the spillover effect is strong, then it could suggest that offenders travel to commit crime.  I modelled the spillover effect for assaults as well as thefts from inside vehicles, and the latter showed no spillover effect.  This could tell us something about the travel behaviour of offenders for these two types of crimes.

On names and citations: part I

Have you ever wondered if the name of a researcher has a systematic influence on the citations of their work?  Me neither, until recently, when a student told me that he read somewhere that people with surnames starting with letters near the end of the alphabet are cited less than people with names starting with letters near the beginning of the alphabet.  I did a quick look for literature on this subject, but didn’t find anything.  So naturally, I set out to answer the question myself.  Specifically:

Does having a long name that starts with a letter late in the alphabet influence the number of citations a researcher’s work receives?


PubMed is a search engine for medical research; if you go to the PubMed site, you can search for articles (and retrieve abstracts) on any medical subject that you are interested in.  It has an API (application programming interface) that allows one to search and retrieve cited material within the PubMed system programmatically.  This allows for fast access and searching on a large scale that can’t be done by searching manually.  I used the R library rentrez to access this API and gather PubMed data to answer my research question. I didn’t see it wise (or possible) to try to pull all the PubMed data offline, so I took a random sample of articles based on the unique PMID identifier that PubMed uses to uniquely identify articles.  I did this by generating 30000 random numbers between 1000000 and 9999999, and pulling down information on a all Pubmed articles with PMIDs in this list of numbers.  Some randomly generated PMIDs did not have a corresponding PubMed record, but most in this interval did, and this left me about 28000 articles.  Based on the PMID numbers I generated, publication years of the retrieved articles were between 1963 and 1999. The list of articles excludes all literature that is not indexed in PubMed, so it’s not a perfectly representative sample of all academic research, though it is pretty representative of health/medical research over the time period.

For each article, I summed the number of PubMed citations of that article by using the API’s link function.  This is restricted to links within PubMed sources, so is an under count of total citations.  I also identified the number of authors per article, identified the surname of the first author, the length of the first author’s surname and the first letter of the first authors surname.

I then found a data file from the 2010 US census which has estimates of ethnicity for all the surnames of about 90% of the US population.  These data include self-identified ethnic background (White, Black, Asian, Hispanic and other), and are represented as proportions; a surname with 1.00 white would mean that 100% of the US population with that surname identify as white.  I linked this list of surnames to the surnames in the PubMed data.  I dropped all records that did not link to the US census.  The final data set has 19219 records in it.

To analyze these data I created a model that predicts the number of citations as a function of surname length, first letter, proportion white ethnicity, number of authors, year and year squared.  I used negative binomial regression because the dependent variable is a discrete count, and over-dispersion seemed likely (the variance of the dependent variable is larger than the mean of the dependent variable).


The result of the main model is here:

What does this table suggest?  Well first, the impact of first letter seems pretty small, however long names seem to be a liability when it comes to citations per article.  Further, the number of authors has a considerable impact on citations (more authors means more citations) and having a name that is typically associated with white self-identification is also associated with more citations.

To help contextualize these results, I considered two scenarios related to my name.  Scenario 1 is the ‘real world’ in which I have the ridiculously long surname Yiannakoulias I was assigned at birth.  Scenario 2 assumes that I took my mother’s surname (which is short, and starts nearer to the front of the alphabet).  What is the difference between these scenarios if we were to estimate citations of a paper published in 1999 with 5 co-authors?

Scenario #1 (real-world) predicted citations: 7.07

Scenario #2 (mother’s surname): 9.34

Aha!  My surname name is a liability!  Or is it?  I ran the same model again, but this time adding an interaction term between length of name and self identified ethnic status.  My intuition is that a long name that is ‘white’ is different from a long non-white/less-white name.  The results of this updated model are here:

I also have model predictions of citations per paper based on the same two scenarios:

Scenario #1 (real-world) predicted citations: 8.39

Scenario #2 (mother’s surname): 9.16

This result suggests the whiteness of my name offsets most of the disadvantage of having a long name, and the letter ‘Y’ doesn’t really have much of an effect.  This answers the question I set out to answer from the beginning.


There are a number of potential weaknesses to this analysis; the main one is the link to the list of surnames.  People with very uncommon surnames are excluded from my analysis, which could definitely introduce a bias in the findings–unusual surnames may be cited less by virtue of their infrequency.

Nevertheless, I think I can probably draw some tentative conclusions here; as much as I’d like to chalk up my scholarly mediocrity to the misfortune of having a cumbersome surname, I can’t.  My name is not much of a liability to my academic career.  More generally, having a name with a first letter starting at the end of the alphabet doesn’t really matter when it comes to PubMed citations, and even long names don’t matter much–provided they are ‘white’ names.  This somewhat alarming ethnic bias requires some more exploration, which I will look into in part II of this analysis in the upcoming weeks…


Publicly funded snow shovelling

In preparation for a class this week, I have done a simple little analysis to explore the economic case for publicly funded snow shovelling.   A number of studies have been done on the impacts of snow shovelling on heart attacks, and in the more general field of physical exertion and heart attacks.  Given the research, it seems reasonable to consider whether or not a public system of snow shovelling could save money.  Below I describe each step in the analysis.

1. Shovelling snow causes heart attacks

According to Auger et al., 2017, a man’s risk of myocardial infarction (MI) is 1.34 times higher the day following a major snowfall (20+ cm of snow) compared to no snowfall.  This tells relative risk; that is, the risk associated with snowfall compared to no-snowfall.  In order to assess the impact this has on public health, we need to have some measure of absolute heart attack risk–like an incidence rate.  It’s tricky to estimate the incidence rate in this case, however.  Choosing the annual incidence rate (around 0.002) is way too high, since this incidence is estimated over the whole year.  Indeed, the probability of heart attack on a day with 20 or more centimetres of snow is probably at least two orders of magnitude lower than the annual incidence rate.  Let’s assume that the baseline incidence is 0.00002–this is roughly the daily risk of MI.  This means that every day there is a 20 cm snowfall, the risk is of heart attack is 0.0000268 (0.00002 x 1.34).

2. There are 160,000 detached houses in Hamilton

How many men are exposed to the hazards of snow shovelling?  I base this estimate on the 2011 NHS, from which I pulled the number of households in Hamilton, and divided it by two.  This assumes half the time women shovel, and half the time men shovel, and that on average, every household has one man in it.  This gives us 80,000 men exposed to shovelling.  This is probably an overestimate, since some people hire snow-clearing companies, and some households just don’t bother shovelling at all.

3. The risk attributable to snowfall is…

Based on step 1, the risk of heart attack on the day following 20 cm of snowfall = 0.0000268.  The risk attributable to shovelling is the difference between the risk of heart attack among the exposed and the baseline risk: 0.0000268 – 0.00002 = 0.0000068 (or about 6.8 per million people).

To find out how many people suffer heart attacks in Hamilton due to a major snowfall event, we simply multiply the attributable risk by the exposed population:

0.0000068 x 80,000 = 0.544

These numbers suggest that once every two years, someone has a heart attack as a result of shovelling snow in Hamilton, assuming there is one 20 cm snow event per year.  One 20 cm snow event per year is probably a bit of an over-estimate, but reasonable enough based on these data.

4. How much does a heart attack cost the economy?

A complete assessment should include all costs, including health care, lost productivity, etc.  However, many of these costs are very hard to measure.  In Canada, health care costs are around $30,000 per MI.  However, other losses could be greater, especially in the long run, and especially if we tried to price the value of a life.  Let’s say that each heart attack costs $150,000.

5. Cost-benefit

If we multiply 0.544 x 150,000 we get the expected annual costs of heart attacks due to big snowfall events in Hamilton.  This gives us about $75,000 a year.  Given the large number of households to shovel (160,000) and the costs of shovelling them (even a modest $25 per household per season costs $4,000,000 a year) there is no economic case for snow shovelling, at least when it comes to heart attacks.

When put into the context of public health impact for a small city like Hamilton, the results of Auger et al., 2017 are not particularly compelling.  Even if the relative risk estimate is correct, the actual impact on the population in this city is probably pretty small.  For the entire country, the impact is greater; I’d ballpark it at 40 heart attacks as a result of shovelling snow, and maybe 3 or so deaths a year.  Still, given that there are around 50,000 deaths a year (in Canada) as a result of heart disease, it seems that the impact of snowfall on MI is pretty small.

It’s also worth noting that snow shovelling may also have health benefits–like exercise, and sharing time with neighbours.  Since exercise improves health, and possibly saves money, this would make the economic case for publicly funded snow shovelling even weaker than what’s presented here.

Perils of user generated epidemiology

I recently worked with some students on a short analysis of Lyme disease content on YouTube.  I thought the results of the work were worth writing up for publication.  The paper will be published in the journal Social Science & Medicine in the fall, and the online manuscript version is available for download until October 29th:

The findings are pretty intuitive.  There is considerable YouTube content on Lyme disease, but most of it is neither scientific nor focused on infection prevention.  Personal stories and videos about celebrities are popular and are among the most common content available.  Public health agencies and academics produce very little content, and the content they do produce doesn’t receive much viewer interest. This may explain why the public health content that is most available on YouTube is often inconsistent with best practices recommended by experts in the field.

Given the large number of video views on Lyme disease, and the absence of content published by experts, it seems that public health agencies should create more content for YouTube.  This is not simply a scholarly exercise, but a scientific and even moral imperative–many people get information from YouTube and other forms of online user generated media, and need to be provided better information.  To maximize efficacy, these agencies should generate content that combines evidence-based public health information with personal stories about people who have experiences living in a areas where Lyme disease risk is high.