Monthly Archives: November 2017

Geography of crime: population spillover problem in crime mapping

The problem

Consider the following scenario:

You live in a small low population suburb.  A complex of apartments is built in the neighbouring area.  The population density in the complex is very high, unlike the area in which you live.  Over time, you observe the frequency of petty crime increase in your neighbourhood.  You and your neighbours reason that this is mostly due to the residents of the new apartment complex.  Your complaints about crime germinate into hostile feelings to your apartment dwelling neighbours, and you start to form beliefs about them as people.  More specifically, since a disproportionate number of them also have a different cultural and ethnic background than people living in your neighbourhood, you start to believe that people with this background are more likely to commit crime.

In this scenario, let us assume that the following statements are objectively true:

  1. The number of crimes in the general area has gone up because of the people in the neighbouring apartment complex
  2. Risk of being a victim of crime has gone up in your neighbourhood because of the people in the neighbouring apartment complex

In spite of the above statements, it does not also follow that people in the apartment complex are more likely to commit crime than the people living in the low housing density neighbourhood nearby, or indeed, that there is any difference in anyone’s predilection to commit crime in the area generally.  If the reason for this isn’t obvious, I’ll elaborate on why this is the case below.  However, for the moment, it’s easy to see how a person could connect an increase in frequency of crimes with the predilection for committing crime based on a fairly reasonable and natural assessment of the facts.  The problem is that the frequency and rate of crime do not tell us about the people who commit crime; a geographical cluster of high crime rate can occur even if there is no geographic pattern in the likelihood that people will commit crime.

What is going on?

If all people committed crime in the neighbourhoods in which they lived, and crime rates were calculated only in neighbourhoods, then a neighbourhood’s crime rate could be a pretty good indicator of crime risk as well as the disposition to commit crime.  However, people do not restrict themselves to neighbourhood boundaries; offenders go where the opportunities present themselves, and for some types of crime, that could mean travelling some distance away from home.  There is empirical and theoretical research on how offenders travel to crime which varies by crime type, age and other factors.  For types of crime that are committed outside the home, where crime happens may tell us very little about where the offender is from.

The effect here is a ‘spillover’ in crime from higher population areas to lower population areas that can cause the apparent risk of crime in low population areas to be high even if the disposition to commit crime and the suitable targets for committing crime are geographically constant and non varying (‘spatially homogeneous’).  There has been considerable research [1,2,3,4]  into the spillover effect of public housing on crime in the US (most of which has found little relationship between public housing and crime in neighbouring areas) but I am not sure if anyone has analysed this population spillover effect specifically.

This diagram illustrates how this spillover can occur, and the effect on crime rate.

The two black squares are neighbourhoods–one small population (100 people) one large population (1000 people).  The curved black lines are the ‘trips’ that an offender takes from place of residence to where they commit the criminal offence.  The green circles are the residences, and the red squares are the offence locations.  In this example, the left hand square (a low population neighbourhood) has the same proportion of criminals as the right hand square (high population neighbourhood).  But because of the spillover effect, the crime rate is much higher in the low population area.

Why does this matter?

One of the consequences of this spillover is that it can lead to an inferential fallacy; the fact that crime rates are higher in the low population neighbourhood as a result of the high population neighbourhood could lead to incorrect generalisation about the individuals in the high population neighbourhood.  This spillover in crime could happen even if the people in the high population neighbourhood were less likely than average to commit crimes.  So at the very least, this should serve as a reminder that our intuitions–even when based on data–need to be carefully scrutinised, since we can be fairly easily mislead.

This spillover effect can also influence how we understand and attempt to explain patterns of crime.  I used data from the City of Edmonton to estimate the impact of the spillover on the risk of assault.  The interesting result is that neighbouring populations do seem to impact the risk of assault; the model I used (using 2016 crime and population data) suggests that for every 10,000 more people living in the regions surrounding your neighbourhood, there is between a 1.20 and 2.24 increased risk of assault.  To put this into perspective, the average person’s baseline annual risk of assault is 7 per 1000 (or 0.7% chance of being assaulted per year).  If you lived in a neighbourhood surrounded by a population of 20,000 people, then your risk of being assaulted is between 1% and 3.5% per year.

If you’d like to see how I did the analysis, follow this link to GitHub.


This population spillover effect is consequential for two reasons.  First, it may influence how people perceive their neighbours, but in ways that are almost certainly not helpful to social cohesion and sense of community.  Living next to an apartment building may slightly increase your risk of being a victim of some crimes, but this observation says nothing specifically about apartment dwellers’ dispositions to commit crime.  It could be that apartment dwellers as individuals are even slightly less likely to commit crime, and yet, could still be (as a geographical group) responsible for an increase in the rate of crime in nearby communities.  Come to think of it, this is an interesting example of the ecological fallacy!

My preliminary analysis also suggests that incorporating the effect of this spillover into our analyses could improve predictions of crime, although this is likely to vary by crime type.

Finally, the concept may be useful for indirectly estimating the range of criminal activity; it is possible that the presence/ absence of crime spillover may indicate how far people may travel to commit crime.  If there is no spillover effect, then crime may occur closer to home.  If the spillover effect is strong, then it could suggest that offenders travel to commit crime.  I modelled the spillover effect for assaults as well as thefts from inside vehicles, and the latter showed no spillover effect.  This could tell us something about the travel behaviour of offenders for these two types of crimes.

On names and citations: part I

Have you ever wondered if the name of a researcher has a systematic influence on the citations of their work?  Me neither, until recently, when a student told me that he read somewhere that people with surnames starting with letters near the end of the alphabet are cited less than people with names starting with letters near the beginning of the alphabet.  I did a quick look for literature on this subject, but didn’t find anything.  So naturally, I set out to answer the question myself.  Specifically:

Does having a long name that starts with a letter late in the alphabet influence the number of citations a researcher’s work receives?


PubMed is a search engine for medical research; if you go to the PubMed site, you can search for articles (and retrieve abstracts) on any medical subject that you are interested in.  It has an API (application programming interface) that allows one to search and retrieve cited material within the PubMed system programmatically.  This allows for fast access and searching on a large scale that can’t be done by searching manually.  I used the R library rentrez to access this API and gather PubMed data to answer my research question. I didn’t see it wise (or possible) to try to pull all the PubMed data offline, so I took a random sample of articles based on the unique PMID identifier that PubMed uses to uniquely identify articles.  I did this by generating 30000 random numbers between 1000000 and 9999999, and pulling down information on a all Pubmed articles with PMIDs in this list of numbers.  Some randomly generated PMIDs did not have a corresponding PubMed record, but most in this interval did, and this left me about 28000 articles.  Based on the PMID numbers I generated, publication years of the retrieved articles were between 1963 and 1999. The list of articles excludes all literature that is not indexed in PubMed, so it’s not a perfectly representative sample of all academic research, though it is pretty representative of health/medical research over the time period.

For each article, I summed the number of PubMed citations of that article by using the API’s link function.  This is restricted to links within PubMed sources, so is an under count of total citations.  I also identified the number of authors per article, identified the surname of the first author, the length of the first author’s surname and the first letter of the first authors surname.

I then found a data file from the 2010 US census which has estimates of ethnicity for all the surnames of about 90% of the US population.  These data include self-identified ethnic background (White, Black, Asian, Hispanic and other), and are represented as proportions; a surname with 1.00 white would mean that 100% of the US population with that surname identify as white.  I linked this list of surnames to the surnames in the PubMed data.  I dropped all records that did not link to the US census.  The final data set has 19219 records in it.

To analyze these data I created a model that predicts the number of citations as a function of surname length, first letter, proportion white ethnicity, number of authors, year and year squared.  I used negative binomial regression because the dependent variable is a discrete count, and over-dispersion seemed likely (the variance of the dependent variable is larger than the mean of the dependent variable).


The result of the main model is here:

What does this table suggest?  Well first, the impact of first letter seems pretty small, however long names seem to be a liability when it comes to citations per article.  Further, the number of authors has a considerable impact on citations (more authors means more citations) and having a name that is typically associated with white self-identification is also associated with more citations.

To help contextualize these results, I considered two scenarios related to my name.  Scenario 1 is the ‘real world’ in which I have the ridiculously long surname Yiannakoulias I was assigned at birth.  Scenario 2 assumes that I took my mother’s surname (which is short, and starts nearer to the front of the alphabet).  What is the difference between these scenarios if we were to estimate citations of a paper published in 1999 with 5 co-authors?

Scenario #1 (real-world) predicted citations: 7.07

Scenario #2 (mother’s surname): 9.34

Aha!  My surname name is a liability!  Or is it?  I ran the same model again, but this time adding an interaction term between length of name and self identified ethnic status.  My intuition is that a long name that is ‘white’ is different from a long non-white/less-white name.  The results of this updated model are here:

I also have model predictions of citations per paper based on the same two scenarios:

Scenario #1 (real-world) predicted citations: 8.39

Scenario #2 (mother’s surname): 9.16

This result suggests the whiteness of my name offsets most of the disadvantage of having a long name, and the letter ‘Y’ doesn’t really have much of an effect.  This answers the question I set out to answer from the beginning.


There are a number of potential weaknesses to this analysis; the main one is the link to the list of surnames.  People with very uncommon surnames are excluded from my analysis, which could definitely introduce a bias in the findings–unusual surnames may be cited less by virtue of their infrequency.

Nevertheless, I think I can probably draw some tentative conclusions here; as much as I’d like to chalk up my scholarly mediocrity to the misfortune of having a cumbersome surname, I can’t.  My name is not much of a liability to my academic career.  More generally, having a name with a first letter starting at the end of the alphabet doesn’t really matter when it comes to PubMed citations, and even long names don’t matter much–provided they are ‘white’ names.  This somewhat alarming ethnic bias requires some more exploration, which I will look into in part II of this analysis in the upcoming weeks…