The last five to 10 years has seen a new wave of analytical1 enthusiasm. The enthusiasm is not specific to any one field, but boasts a renaissance in methodology–specifically, using data (often open source) to answer questions and solve problems outside the traditional research framework. These ‘data scientists’ have made breakthroughs into the professional ranks, and even popular culture in the last few years. They have excelled at making data seem interesting and useful in new and engaging ways, and usually share their work on blogs, online news sites and social media. Moreover, they seem to be affecting how analysis is done; most new data scientists are using open source tools like Python, R and QGIS, and by doing so, are helping to establish the viability of open source alternatives that threaten the long-term success of commercial analytical platforms.
The trouble with analysis
One of the problems with the recent analytical turn is the bubble of enthusiasm surrounding it. New technologies are often accompanied by a ‘wow’ factor that multiplies their apparent value beyond their real value. This happened with the tech bubble of the late 1990s–enthusiasm for all things ‘tech’ encouraged many investors (in both the stock market and venture capital) to put money into companies with few credentials and little potential for success. Some investors did not look carefully into the profitability or viability of these ventures, which would have taken time and effort, and instead gambled on the excitement of the time. The race to capitalize on ‘tech’ lead to many bad decisions, and before too long it all came crashing down to earth, and many fortunes were lost.
It’s hard to know what precisely motivates bubble behaviour around new technology, but my intuition is that it is sometimes caused by a disconnect between the people who make decisions and the people who develop and use technology. Not all decision makers are astute at judging new technology, and often fear missing out on the useful revolutions in their field. These decisions makers are averse to the risks of missing out. So when a young enthusiastic geek presents them with the bells and whistles of a new technology, some decision makers are vulnerable to over-estimating the value and positive impact of that technology, and underestimating the costs.
My view is that we are in the midst of a ‘data analysis’ bubble, where some institutions (media, business, government and journalism) are excitedly hacking data without also asking the critical question about whether or not these analyses actual create value. The one example that is most compelling to me is the field of sport analytics, and in particular, the contrast between analytics in baseball and hockey.
Sports journalism and news has seen a strong analytical turn in recent years, though the field of sports analytics isn’t new. In baseball, Sabermetrics–the objective quantitative analysis of players–has been around for at least 30 years, perhaps longer. The book and movie Moneyball brought the concept into the popular mind; the idea is to use baseball statistics to build the strongest team possible for a given outlay of salary. Teams that play ‘moneyball’ target players who are undervalued–with small contracts, but who can help a team win games.
The alternative approach is to use a less formal mix of simple player statistics (like batting average, runs batted in and earned run average) combined with gut feeling. This strategy assumes that general managers have some relatively unique intuition about what makes a player good and bad, and that the best of them can spot diamonds in the rough without relying on fancy numerical analysis.
It’s hard to know which strategy (or mix of strategies) is superior, but the nature of baseball lends itself to numerical analysis. This is because individual players contribute to a team’s success largely independent of other players on their team. An outstanding pitcher can dominate almost entirely irrespective of the players in the field behind him. A power hitter hits home runs by himself–he is not particularly dependent on the players batting around him. So in baseball, assembling a great team is largely an exercise in recruiting the best (and/or most cost effective) individuals in their positions–ensuring that there is enough individual talent to win at a price that is deemed acceptable by the team owner.
Hockey analytics is a general term frequently used to describe the systematic study of hockey player data to inform decisions made by coaches and managers in sport. There are a few different analytical approaches, but the basic idea has been to go beyond the blunt indicators (like goals scored by a skater and saves made by a goaltender) to more sophisticated indicators (like Fenwick and Corsi) to break down an individual player’s performance.
The challenge with the analytical approach in hockey is that hockey teams are a more collaborative enterprise than baseball teams. Players are more dependent on each other in their contributions to the game; players pass to each other, fill-in for each other when out of position, tip shots, screen shots, block shots and do all sorts of things that influence the successes and failures of their teammates and the team as a whole. Dependence is a conceptual (and statistical) problem because it makes it hard to know when a player with favourable statistics has these favourable statistics because he’s actually good, or because he’s playing on a team that raises his level of play.
For the purpose of illustration, assume that in baseball 90% of a player’s statistical performance is self determined, and 10% is based on teammates. A general manager can make a pretty clear determination of this player’s value independent of his teammates because he knows that 90% of the player’s statistical qualities are due entirely to himself. In hockey, the self-determined part of a player’s performance is lower than that of a baseball player, probably a lot lower. For the sake of illustration, lets say it’s 60%–this means 40% of a player’s performance is team determined (dependent). In hockey, a general manager’s assessment of a player’s fundamental value is almost certainly not going to be as precise as it is in baseball. This is because the player’s performance is more influenced by mix of other players on the team.
The fact that hockey analytics is likely to be less useful than baseball analytics is not evidence that hockey analytics doesn’t have value. However, it suggests that claims based on hockey analytics require a higher standard of proof; when general managers decide on the approach to analysis, they have to be aware of the limitations of hockey analytics, and not get too caught up in the ‘wow’ that analytics offers. Nor should they believe that hockey analytics works because baseball analytics works. Given the difficulty (if not impossibility) of isolating the independent components of hockey player performance, hockey analytics will never supply the level of useful information that is available in baseball analytics.
Pseudo-analysis and data science
The increased availability of data and the tools to use them offer great potential for political decision makers, businesses and curious citizens. However, data analysis is tricky, often not very intuitive, and requires careful thought to use and interpret. Statistics do not lie, but people can easily lie with them, or more often, simply get them wrong. The emergence of data science needs to be rigorously scrutinized if it is going to inform; our world is already filled with false experts, we don’t need any more of them.
Are the emperors of hockey analytics naked? At this moment, I have no reason to think hockey analysts of any type are actually experts–I have not seen any evidence that a team that employs any particular analytical strategy is likely to do better in the standings, win more playoff games or win more Stanley Cups than a team that does analysis the traditional way. If data exist that say otherwise, I’d love to see them. Until then, the ‘analytics turn’ of the NHL is a curiosity to me, and I’ll be interested in seeing how it plays out. I also suspect that many owners will be/are asking the same question before too long, if they haven’t started doing that already.
1. My focus here is on analysis with numbers. Non-numeric analysis is an entirely different animal…back