Sometimes I doodle at Faculty meetings…
Here’s an interactive map you can use to look at the average annual changes in income and dwelling value in Hamilton, Ontario from 2000 to 2015. Zoom in with the +/- sign, and navigate around with your mouse. Click on a house icon to see average annual change in dwelling value, and on the area immediately surrounding it to see the average annual change in income. Positive values indicate an increase, and negative values indicate a decrease.
Data are from the census at the dissemination area level. Missing data are excluded from the map; this includes any DA for which data were missing for any of the years between 2000 and 2015. The numbers are the slopes of the linear trend fitted to the four Census years (2000,2005,2010,2015). So in short, these represent average linear change over the 15 year period. The values have been divided by 5 (the between census interval) to give an annual average. All data are in 2015 dollars using Bank of Canada inflation adjustments.
I use Google Alerts to notify me of English language reporting on cancer clusters. I have been keeping track of some stories about cancer clusters for a few years, and I have noticed a fairly persistent model for how concerns about cancer clusters enter the public sphere. It looks like this:
As an example, we can look at the Auburn ocular melanoma cluster in Alabama. The first main reporting on the cluster was back in February 2, 2018. At the time, reporting suggested 5 cases, 3 women and 2 men, all of whom had links to Auburn University in over the same period.
Right around when these first media stories came out a Facebook support group was formed. By February 13th, another media outlet reported the cluster now included 18 people. The Facebook account reported 31 cases as of March 22, 2018. By April 4, Healthline, an online media outlet reported a cluster of 33 people.
I’ts hard to say whether or not this cluster is a real concern. Five cases may or may not be higher than expected by chance (it’s tricky to know for sure), but certainly is high enough to justify some further investigation. However, I find it very hard to believe that the cluster has 33 confirmed cases in it. If it did, then there is something seriously, seriously wrong at Auburn.
As everyone knows, social media is excellent at connecting people, but isn’t excellent at sharing correct information, and is usually not a platform for rigorous analysis and decision making. Online media sources (and some traditional source) seem increasingly susceptible to ignoring good journalistic practice, and focus on the sensational–in the case, the large number of cases reported by Facebook–rather than carefully vetting information to confirm it’s validity.
The problem is that this journalistic failure can have many serious and tangible adverse consequences.
First, it can create unnecessary alarm. I imagine many Auburn alumni are now very concerned about their eye health. This concern has an emotional, financial and physical cost to them and their families. The emotional concern and possible medical interventions that could follow may even lead to new other health challenges.
Second, the reputation of Auburn University (and the town it is situated in) may have been damaged. Even if there is a cluster, it’s possible that the cause has nothing to do with the university at all. If an investigation does find fault–either presently, or historically–then someone at the university should be held responsible, but for the moment, there is insufficient evidence to even imply blame.
Third, the outcome of cancer cluster investigations are rarely satisfying to the communities they affect. The vast majority of the time, these investigations find no evidence of a cluster, or even an elevated risk of cancer. To the people in the community this is often inconceivable–especially once the media has amplified their concerns. The result is dissatisfaction, a loss of faith in the institutions involved–including cancer experts and government–and even rifts in the community.
I don’t mean to imply that the media sources behind some of this reporting are being deliberately dishonest, or that the information shouldn’t make the news. However, given the potential consequences of misinformation, they have a responsibility to be exceptionally careful about how they report the story. Unfortunately, I see few examples of the media (traditional or otherwise) reporting this information with the necessary care or attention to detail.
As I have proposed before, one solution to this problem is to get out ahead of it. Government agencies need to do routine surveillance of cancer and main environmental cancer determinants, and then routinely report this information to the public. This openness can build trust, and inform the public about what the risks actually are, and provides useful context to media reports that could emerge over time. This also increases the rigour of cancer investigations.
There are many challenges to implementing such surveillance schemes, perhaps chief among them is cost of implementation. However, the costs of cancer cluster investigations are not trivial. I am not aware of any analysis of the actual economic costs, but even if we assume that there are only 1000 investigations a year in the US (probably a low estimate) and that each costs $100,000 in salaries, travel, lab costs, etc., then that’s $100 million a year. Routine cancer surveillance does not have to cost much money as the data are already collected as standard practice in many jurisdictions, and the monitoring for clusters could be done using fairly simple machine learning systems.
Even setting the costs aside, the benefit of a routine surveillance approach is that real clusters are more likely to be detected in a timely manner. Good surveillance systems may be able to identify statistical anomalies earlier on in the process,which could help reduce the risk of future harm.
Cancer clusters have been a fraught subject for decades. People affected, statisticians, epidemiologists and physicians all have their own take on it generally, and in specific cases, and sometimes furiously disagree. Unfortunately, some media participation in this subject stirs up controversy and concern. Since social media and dubious online reporting is here to stay, we need to improve cluster surveillance practice to get ahead of the challenge.
About 7 years ago or so I had a graduate student working on geographic patterns of arson in Toronto. We published one of the chapters from her work, but then the other one lingers. It lingers because it was pretty clear to me that in spite of the fact that our analysis suggested some interesting processes at work, arson is a black swan.
What do I mean by black swan? Well, I mean it in the sense used by Nasim Taleb, a cranky statistical philosopher who authored three important books: Fooled by Randomness (my favourite), The Black Swan and Antifragile. Taleb’s focus is on black swan events–very rare but highly impactful phenomena that are very hard (if not impossible) to predict, but that we often come up with explanations for after the fact. Examples include stock market collapses and major terror attacks. Taleb argues that we give false authority to experts who claim to understand black swans, and recommends that instead of trying to predict or explain these events, we should learn how to build systems that actually benefit from black swans when they do happen.
Arson is a black swan spatial process because the realization of arson frequency in space is made up of a small amount of explanable variation (population, poverty, housing conditions, street permiability) and a whole lot of hard to explain variation. The unexplained variation could be due to many processes, known and unknown. For example, the unexplained variation could be driven by serial behaviour; an arsonists sets a large number of fires in a small area in one year, and then nothing in the next. We know that serial criminal behaviour occurs, however predicting it is hard (if not impossible). Or perhaps it has to do with some unknown process. In either case, our work on this problem strongly suggested that predicting the location of small clusters of intense arson activity will occur in the long run is a fool’s errand.
What’s the problem?
It is fairly easy to publish research showing only the explained part of a system even if the explained part is a small component of the variation of the system overall. This is because any explanation (even if small) seems to be of some value. If a physician tells you that you need to change your diet to reduce your blood pressure, she’s not making a specific prediction about what will happen to your blood pressure if you don’t change your diet. This is impossible to know. in fact, most variation in blood pressure is not caused by diet. Nevertheless, she’s using information that shows how diet explains some variation in blood pressure in populations as a whole to offer you advice that, on average, is probably helpful.
When we worked on the factors that explain geographic variation in arson, my student came up with a model that explained some of the geographic variation in arson. She was even able to identify which areas of the city had higher and lower arson frequency in the long run. However, year to year black swans (what I suspect are probably clusters of unpredictable serial arson behaviour) made the predictions of arson quite poor, usually leading to major under-predictions. The following figure is illustrative:
The purple dots on this figure are the actual number of arson events in a given year across Toronto neighbourhoods. In some neighbourhoods there are over 35 arson events in a year, but the city average is around 2 or 3. Attempts to model these data using things like population, commercial activity, poverty, street permiability and other factors can’t predict these extreme variations, and we never found a term to put into a model that picks up much of this variation in a training data set and can then predict the variation in other data sets.
(technical note: what’s particularly funny about this above example is that the ‘best’ performing model structure here is the old-fashioned linear model, seemingly because it picks up the possibility of extreme variations better than models that attempt to parameterize it through some specific model link structure or some scaling parameter.
Given enough data, anyone can model some of the pattern in almost any phenomena. The fundamental question is how well can your model predict future patterns? For this arson project, the predictions were just not compelling enough; sure, we could predict the relatively higher and lower arson neighbouhoods, and some of the factors that may explain some variation from neighbourhood to neighbourhood. However, the real challenge is being able to predict the extremes–the neighbouhoods which are suddenly targetted by serial arsonists, and result in a large number of arson events that whip up fear and threaten the safety of a community. This is not a simple task, and we certainly had no luck with it, so the chapter sits unpublished.
This also points to the importance of context; a model that explains a small amount of variation in a system might be very useful if if that knowledge can save lives, or save money. In this case, I did not feel that our model was useful for anything–not for arson prevention, policing, urban design, etc., even if it did explain some variation in arson. However, going back to the hypertension example, there is evidence that a little information about diet and hypertension might be useful at a population scale.
I present here a simple idea for breaking down how I typically plan out courses.
I have three considerations: time (T), accessibility (A) and rigour (R). Accessibility is the breadth of audience that I reach; basically, the number of students who will get value from a lecture or class. Rigour is the completeness of the material. Time is the time available to teach.
With this in mind, I propose the following.
1. Time is proportional to the product of accessibility and rigour (T = A*R)
Time increases as rigour and accessibility increase
2. Accessibility is proportional to time divided by rigour (A = T/R)
The idea here is that if infinite time were available, it would be possible to teach any student anything with as much rigour as required.
3. Rigour is proportional to time divided by accessibility (R = T/A)
For a fixed period of time, any increase in accessibility will reduce rigour.
With this in mind, we get the following visual model to help understand the relationship:
As a university professor I have some control over time, but not much. I do have control over accessibility and rigour. For courses in which I know the material must remain accessible to a broad audience, I generally have to lower rigour. If a course needs to be rigorous, then I expect accessibility to decline.
While I have little control over classroom time, I have discovered that online tools can be useful for increasing the time of instruction. Using readings, online quizzes, and video content, I can increase content without requiring more class time. I use this extra time to delve into details I can’t cover in class–and add rigour.
This is all obvious to experienced instructors, however, my treatment here is a bit more rigorous than what one typically sees in discussions of teaching strategies. Which, unfortunately, means I very likely lost your attention several paragraphs ago.