Monthly Archives: August 2016

Correlation between homicide and suicide rates

I used some WHO and World Bank data to tabulate suicide and homicide rates for countries around the world.  I did this because I was wondering if the legal, social and other structural factors that influence homicide also influence suicide.  Some factors –like poverty and social inequality–may jointly contribute to more homicide and more suicide.  Other factors–like education–may contribute to less homicide and less suicide.  Yet other factors (like the age structure of the population) may result in more homicide and less suicide, or the opposite.  Looking at the correlation between the rates might indicate if and where such processes exist.


This is a simple ecological correlation study at the international level.  I do not mean to imply a causal association between homicide and suicide, but that a correlation between these measures could suggest the existence of some common factors that influences both.  A large number of countries are missing from this analysis because of incomplete data, or because the data linkage would have required some manual tinkering with country names, but I came up with data for 94 countries for analysis.  Here is the R code, here are the homicide and suicide data, and here are the population data.


There is a moderate correlation (R=0.35) between the log of suicide and the log of homicide rates.  More interesting is the graphical representation of this relationship:


There are three visible clusters here.  First is the high homicide-high suicide group, which are mostly countries in Central and South America (orange circle).  Then there is the lower homicide and higher suicide group (blue circle), which is wealthier European countries.  The remaining cluster (green circle) is more dispersed, but seems to show a more clearly proportional relationship between suicide and homicide.


What’s the take away message here?  Homicide and suicide are both the result of human decisions, but there is a clear structural process at work.  This process is very likely complex and multidimensional, but suggests that legal, cultural, social and economic forces influence these decisions at a fundamental level.  Wealthier western countries seem to have more structural controls over homicide, and fewer controls over suicide. Less homicide could be due to the prohibition of guns or efficacy of law enforcement, neither of which would be as effective at preventing suicide.  Central American countries have less structural controls over homicide, but more structural controls over suicide–the former is probably the result of the social dysfunction associated with the drug trade, and the latter could be influenced by religion.  For the remaining countries of the world the relationship seems more complex, but does suggest that the structural controls over homicide and suicide might be similar.


Some of these data are not entirely trustworthy.  The homicide rate in Honduras is too low in these data, possibly due to under-reporting in health data; I defined homicide from the WHO database that uses ICD10 codes.  Honduras may not have a good vital statistics system that codes homicide properly.   Or there could be a mistake somewhere.  In spite of this, overall, the data look reasonable.  In the future, it would be interesting to see how this pattern emerges over time.



The dubious art of hockey analytics

The last five to 10 years has seen a new wave of analytical1 enthusiasm.  The enthusiasm is not specific to any one field, but boasts a renaissance in methodology–specifically, using data  (often open source) to answer questions and solve problems outside the traditional research framework. These ‘data scientists’  have made breakthroughs into the professional ranks, and even popular culture in the last few years. They have excelled at making data seem interesting and useful in new and engaging ways, and usually share their work on blogs, online news sites and social media.  Moreover, they seem to be affecting how analysis is done; most new data scientists are using open source tools like PythonR and QGIS, and by doing so, are helping to establish the viability of open source alternatives that threaten the long-term success of commercial analytical platforms.

The trouble with analysis

One of the problems with the recent analytical turn is the bubble of enthusiasm surrounding it.  New technologies are often accompanied by a  ‘wow’ factor that multiplies their apparent value beyond their real value. This happened with the tech bubble of the late 1990s–enthusiasm for all things ‘tech’ encouraged many investors (in both the stock market and venture capital) to put money into companies with few credentials and little potential for success.  Some investors did not look carefully into the profitability or viability of these ventures, which would have taken time and effort, and instead gambled on the excitement of the time.  The race to capitalize on ‘tech’ lead to many bad decisions, and before too long it all came crashing down to earth, and many fortunes were lost.

It’s hard to know what precisely motivates bubble behaviour around new technology, but my intuition is that it is sometimes caused by a disconnect between the people who make decisions and the people who develop and use technology.  Not all decision makers are astute at judging new technology, and often fear missing out on the useful revolutions in their field.  These decisions makers are averse to the risks of missing out.   So when a young enthusiastic geek presents them with the bells and whistles of a new technology, some decision makers are vulnerable to over-estimating the value and positive impact of that technology, and underestimating the costs.

My view is that we are in the midst of a ‘data analysis’ bubble, where some institutions (media, business, government and journalism) are excitedly hacking data without also asking the critical question about whether or not these analyses actual create value.  The one example that is most compelling to me is the field of sport analytics, and in particular, the contrast between analytics in baseball and hockey.

Sport analytics

Sports journalism and news has seen a strong analytical turn in recent years, though the field of sports analytics isn’t new.  In baseball, Sabermetrics–the objective quantitative analysis of players–has been around for at least 30 years, perhaps longer.  The book and movie Moneyball brought the concept into the popular mind; the idea is to use baseball statistics to build the strongest team possible for a given outlay of salary.  Teams that play ‘moneyball’ target players who are undervalued–with small contracts, but who can help a team win games.

The alternative approach is to use a less formal mix of simple player statistics (like batting average, runs batted in and earned run average) combined with gut feeling.  This strategy assumes that general managers have some relatively unique intuition about what makes a player good and bad, and that the best of them can spot diamonds in the rough without relying on fancy numerical analysis.

It’s hard to know which strategy (or mix of strategies) is superior, but the nature of baseball lends itself to numerical analysis.  This is because individual players contribute to a team’s success largely independent of other players on their team.  An outstanding pitcher can dominate almost entirely irrespective of the players in the field behind him.  A power hitter hits home runs by himself–he is not particularly dependent on the players batting around him.  So in baseball, assembling a great team is largely an exercise in recruiting the best (and/or most cost effective) individuals in their positions–ensuring that there is enough individual talent to win at a price that is deemed acceptable by the team owner.

Hockey analytics

Hockey analytics is a general term frequently used to describe the systematic study of hockey player data to inform decisions made by coaches and managers in sport.  There are a few different analytical approaches, but the basic idea has been to go beyond the blunt indicators (like goals scored by a skater and saves made by a goaltender) to more sophisticated indicators (like Fenwick and Corsi) to break down an individual player’s performance.

The challenge with the analytical approach in hockey is that hockey teams are a more collaborative enterprise than baseball teams.  Players are more dependent on each other in their contributions to the game; players pass to each other, fill-in for each other when out of position, tip shots, screen shots, block shots and do all sorts of things that influence the successes and failures of their teammates and the team as a whole.  Dependence is a conceptual (and statistical) problem because it makes it hard to know when a player with favourable statistics has these favourable statistics because he’s actually good, or because he’s playing on a team that raises his level of play.

For the purpose of illustration, assume that in baseball 90% of a player’s statistical performance is self determined, and 10% is based on teammates.  A general manager can make a pretty clear determination of this player’s value independent of his teammates because he knows that 90% of the player’s statistical qualities are due entirely to himself.  In hockey, the self-determined part of a player’s performance is lower than that of a baseball player, probably a lot lower.  For the sake of illustration, lets say it’s 60%–this means 40% of a player’s performance is team determined (dependent).  In hockey, a general manager’s assessment of a player’s fundamental value is almost certainly not going to be as precise as it is in baseball.  This is because the player’s performance is more influenced by mix of other players on the team.

The fact that hockey analytics is likely to be less useful than baseball analytics is not evidence that hockey analytics doesn’t have value.  However, it suggests that claims based on hockey analytics require a higher standard of proof; when general managers decide on the approach to analysis, they have to be aware of the limitations of hockey analytics, and not get too caught up in the ‘wow’ that analytics offers.  Nor should they believe that hockey analytics works because baseball analytics works.  Given the difficulty (if not impossibility) of isolating the independent  components of hockey player performance, hockey analytics will never supply the level of useful information that is available in baseball analytics.

Pseudo-analysis and data science

The increased availability of data and the tools to use them offer great potential for political decision makers, businesses and curious citizens.  However, data analysis is tricky, often not very intuitive, and requires careful thought to use and interpret.  Statistics do not lie, but people can easily lie with them, or more often, simply get them wrong.  The emergence of data science needs to be rigorously scrutinized if it is going to inform; our world is already filled with false experts, we don’t need any more of them.

Are the emperors of hockey analytics naked? At this moment, I have no reason to think hockey analysts of any type are actually experts–I have not seen any evidence that a team that employs any particular analytical strategy is likely to do better in the standings, win more playoff games or win more Stanley Cups than a team that does analysis the traditional way.  If data exist that say otherwise, I’d love to see them. Until then, the ‘analytics turn’ of the NHL is a curiosity to me, and I’ll be interested in seeing how it plays out. I also suspect that many owners will be/are asking the same question before too long, if they haven’t started doing that already.

1. My focus here is on analysis with numbers. Non-numeric analysis is an entirely different animal…back