On Disease Clusters

(an abbreviated version of this article was published in the Hamilton Spectator in February 2017)

Every once in a while a news source will report a local concern about the high frequency of a disease in some community  This recently happened in Ottawa, where some residents have developed a concern about the apparently large number of multiple sclerosis cases in a their community.  Well known examples include the concerns about cancer clusters in Hinkley California (the basis for the movie Erin Brokovich), a cluster of premature death near the Bhopal disaster in India, high thyroid cancer incidence in some people near the Chernobyl disaster in Ukraine, and more recently, the fear of cancer in Fort Chipewyan Alberta, a community situated downstream of the Oil Sands projects in northern Alberta.

Disease clusters become a concern because they may indicate the presence of a local environmental hazard that should be addressed.  The challenge, however, is determining whether a local concern about a disease actually qualifies as a real disease cluster, and if not, convincing the public that their intuitions about the suspected cluster might be wrong.  Academics have struggled over the problem of disease clusters for decades, and while many of the technical and statistical details have been more or less worked out, policy and legal issues remain an important stumbling block.

In this article I want to discuss the nature of disease clusters, why fear of disease clusters can be both natural and unwarranted, and a possible strategy for better (and more publicly satisfying) disease cluster surveillance in the future.

What are disease clusters?

As a working technical definition, I define disease clusters as instances of sickness or death that occur in a relatively small geographic area at a frequency unlikely to have occurred by chance alone.  There are two general classes of disease clusters.  The first is infectious disease clusters–where the immediate cause of illness is a specific pathogenic (‘illness creating’) microorganism–and the other is non-infectious disease clusters.  For infectious disease clusters, clusters are typically investigated by first testing samples of the pathogen (extracted from the people who are sick) in a laboratory setting.  If samples from different people that are sick at roughly the same time are linked to the same pathogen strain, then epidemiologists will look at other commonalities involving the ill–such as whether or not they ate similar food recently, or spent time in similar locations.  Many infectious disease clusters that make news headlines these days are associated with tainted food, and given the global distribution of food today, may not occur within a small geographic area.  Indeed, some infectious disease ‘clusters’ associated with produce have occurred at national and global scales–such as the European fenugreek outbreak in 2011, and the Dole packaged salad outbreak in North America in 2016.

Both concerns and investigations of infectious disease clusters are largely lead by scientists and epidemiologists.  People in the community may identify themselves as sick, but laboratory clinicians and scientists are usually aware of infectious disease clusters before the public.  This is partly because laboratory work done to diagnose disease often initiates the investigation protocols that allow scientists to quickly investigate the sources of disease outbreaks.  Furthermore, the identification of these clusters is generally not a statistical exercise; laboratory procedures may be able to identify a common genetic or immunological feature in a sample that link a pathogen to other instances of disease with near certainty.  When this happens, the cluster investigation involves linking persons infected by this pathogen to the common source (e.g., tainted food), rather than fiddling around with statistical procedures to determine whether or not the cluster represents a true statistical anomaly.


The second type of disease cluster (non-infectious) is more reliant on statistics and probability for identification.  This is because the precise cause of non-infectious disease is usually unclear.  Indeed, epidemiologists typically use the word ‘risk factor’ in place of the word ’cause’ when referring to the putative hazards that may result in a non-infectious illness.  There are many sub-types of non-infectious disease clusters, but the most common concern (by far) is cancer clusters.  Some cancer clusters–particularly in occupational settings–have been central to identifying causes of disease–such as in the case of asbestos and mesothelioma (a type of lung cancer).  However, suspected cancer clusters in the community rarely successfully identify environmental risk factors of consequence, and more often than not, are associated with a high degree of misdirected public concern.

The problem with cancer clusters

The fear of cancer clusters does not usually originate with the cancer registries or public officials tasked to monitor public health.  Instead, members of the public grow concerned about cancer because of 1) a prior fear of something in the environment (like a nuclear power plant, electricity transmission lines or oil sands mining) and 2) a perception that there is more cancer in the community than there should be.  In order to illustrate the process, let’s use a hypothetical example.

Imagine a Canadian family in which a family member has been recently diagnosed with brain cancer. The diagnosis is devastating, and the family struggles to cope.  As news of the diagnosis spreads, other members of the community disclose their encounters with cancer–another brain tumour, a few cases of breast and prostate cancer, and a rare form of leukemia.  All told, there are 12 recent cases of cancer in a small community of only 1000 people in a particular year.  Compared to the national cancer incidence rate (4 in 1000 per year) this seems very high. Moreover, many people in the community have had fears of a nearby industrial facility for decades.  Together, the cancer and the putative hazard create a local buzz.  Before long, newspapers are involved, and politicians are forced to respond.

Most people have an intuition about randomness, and may think that the problem of false clusters can be easily resolved by determining the ‘statistical significance’ of the 12 cases of cancer.   There are many procedures for doing this, and one of them would suggest that there is roughly a 0.1% chance of seeing 12 or more cases of cancer if there was in fact no real difference in cancer risk between this community and the country as a whole.  With a little bit of statistical education a person in the community could calculate this value for themselves, providing stronger evidence of a cancer crisis in the community.


But there is a problem here.  Statistics only have meaning when we have a definable sampling population with which we can contextualize our observations.  The 0.1% chance above seems like an anomaly when viewed only in the context of that single community. But Canada is a country of 35 million or so people within which there are many possible communities of 1000 people.  Some communities are within cities, some are groups of small villages, some are stand alone towns of 1000 people.  In purely theoretical terms, the number of all possible communities of 1000 people that can be made from a population of 35,000,000 people is utterly enormous (and functionally infinite).  But practically speaking we usually think of clusters as geographically contiguous, so the real number of all possible communities is much smaller.  How many identifiable communities of 1000 people is unknown to me, but there are very plausibly thousands of them in Canada.

For any single community the 0.1% value seems to be evidence of an anomaly that can’t be explained by statistical randomness, but given that there are thousands of communities in the country, it is also a statistical certainty that some of them would have at least 12 cases of cancer in a given year even if the true risk of cancer was the same in all the communities.  In fact, we expect about 1 in 1000 communities to have 12 or more cases of cancer by chance alone.  If you have difficulty understanding this, consider gambling.  Based on how the roulette table is designed, we know with certainty that in the long run people will lose money playing roulette.  But every once in a while a person gets lucky and wins.  This is not because the person is a roulette genius, but because with tens of thousand of players playing roulette we expect a few runs of really good luck even if the long-run probability is losing.

Similarly, we can be pretty certain that apparent disease clusters will appear anomalous to people living in some communities even though some anomalies of equal and greater magnitude would be expected to occur even if the true underlying risk of cancer was the same everywhere.  This is because there are many communities of people, all of which represent a metaphorical ‘spin’ of the roulette wheel.

So can statisticians help us determine once and for all when cluster do and do not occur? Well, yes and no.  Statisticians can provide some sense of the likelihood of a cluster being anomaly, and provide an essential starting point for disease cluster investigations.  The problem is that it is very hard to establish a simple criterion for determining whether or not a disease cluster is real because we do not really know the sampling distribution; specifically, we do not know the number of communities within which a cancer cluster could exist.  It’s the equivalent challenge of trying to determine if a coin is fair (50% heads) by flipping it a large number of times, but without keeping tack of how many times it has been flipped.  So unless the cluster is a very striking anomaly–say 100 cases in a population of 1000–then we are somewhat in the dark about whether or not the cluster should really be a concern.

Is there a solution?

In the real world, public fear about non-infectious disease clusters emerges because people combine their perceptions about risk in their community with bits of evidence about disease in their family, social circle, neighbourhood, etc.  Social media has enabled the communication of perspectives and information in positive ways, but can also feed into concerns that are unjustified based on the evidence.  Once these concerns emerge, there is little use trying to convince people otherwise; indeed, doing so could get you labelled a conspirator in the hazard thought to produce harm.  Tensions rise, demands are made of regulators and government, there are civil legal proceedings on occasion, all of which can contribute to bitterness and cynicism, and a weakening of community cohesiveness.

In a paper I authored a few years ago, I advocated the idea of place-based disease surveillance.  Specifically, I argued that governments have to pro-actively define the places in which disease clusters might arise, and engage in a transparent and public monitoring system in these places.  Importantly, the defining of ‘place’  should be a consultative exercise–taking into account what people define as their geographical communities–but the analysis of clusters must be formalized, and the domain of statisticians.  In statistical terms, the communities defined in this process (there could end up being tens of thousands of them) would give us an incomplete but still useful sampling distribution which could then be used in routine statistical surveillance for disease clusters, the results of which could then be openly shared with the public.

The idea here is that by establishing the sampling distribution ahead of time, we now have a means of comparing apparent cancer cluster anomalies with a defined range of expectations in a systematic way.  Consider the table below:


If we assume that there are 35,000 places (with a population size of 1000) that a cancer cluster surveillance program should monitor (perhaps there should be more?), then the expected number of cases (assuming no real geographic variation in risk) would breakdown as above.  We should expect that most communities have at least one case of cancer per year.  When a community somewhere has more than 14 cases of cancer in a year, then work needs to be done to identify possible hazards in the environment.

This is not a profound statistical innovation; generally speaking, this is what cancer surveillance programs try to do.  However, usually little attention is paid to how the places of observation are defined, and even less in releasing systematic local information to the public.  This lack of public involvement may leave communities to think that they are being ignored so that when clusters are observed, they are less likely to contextualize them within the bigger statistical picture.  Instead, the apparent anomaly will create concern that may often do more harm than good.


Thanks to Katie Davidman and Anthony Karosas for their insightful comments on earlier drafts of this article.

Problematic statistical maps

A few years ago the Hamilton Spectator published an interactive web site on cancer data in Hamilton as part of its Code Red series.  There were a number of stories published on the subject highlighting the variation in cancer in the city, and its tendency to apparently cluster in lower Hamilton.

Here is a screen capture of one of the maps:


The map is interactive, in so far as you can hover over the census tracts (roughly equivalent to neighbourhoods) to see the local cancer risks, as well as some other information about the tract.

There are four important things wrong with this graphic.

  1. At first glance, everyone should be terrified when they look at this information–these data suggest that the average person over 45 years of age has somewhere between a 12% and 19% chance of getting cancer every year!  The problem is that these incidence rates are very wrong. The annual age-standardized all cancer incidence in Ontario is around 5-6 per 1000.  For persons over 45, the rate is around 12-15 per 1000, with an annual risk of around 1.5%.  This map suggests risk 10 times that.  I had some discussions about this issue with the authors of this map a few years ago, and I recommended changing how the information was displayed, but they chose to leave the maps unchanged.
  2. The rates are not age-standardized.  Age standardization is used to correct maps for the effect that geographic variation in age can have on disease and death.  The primary risk factor associated with almost all cancer is age, and the effect of age on cancer risk is so strong, that we usually ignore it so we can focus on causes that are modifiable (like smoking).  This map does not properly correct for geographic variations in age, but only maps data for persons over 45 years of age.  This means that some (perhaps even most) of the geographic variation in this map is still due to geographic variation in age.
  3. The incidence rate for the Barton Street East location is reported as 340 per 1000.  This is almost certainly a statistical anomaly due to the small numbers problem.  When population sizes are small, incidence and mortality rates can often appear anomalously high (or low) when a disease is rare (like cancer). Part of the reason I suspect this explains the high rate in this tract is that there are missing census indicators for the pop-up table associated with this community–a problem most typical for small population areas.  At the very least, this rate should be looked into more carefully.
  4. The actual geographic variation in risk in Hamilton is fairly small, but the colour gradient suggests a striking visual contrast, that is, in my view, misleading.  If we divide the highest rate tract (excluding the Barton East anomaly) with the lowest rate tract, we can get the largest rate-ratio. The largest rate ratio measures the largest degree of variation in rates between geographic areas.  In this map, the largest rate ratio is around 1.2.  This means that the highest rate tract has a cancer incidence rate 20% larger than the lowest rate tract .  This is not a huge difference in risk, particularly for cancer which is fairly rare.  It would mean that at most we see a handful more cases of cancer in the highest risk tract compared to the lowest risk tract, and that’s before accounting for differences due to chance or due to the uncontrolled for variations in age between tracts.  The ‘red’ colour compared to the ‘green’ colour is suggestive of a more dire reality–where people in some regions of the city are at significantly greater risk of cancer.

My conclusion

The incidence rates presented on the map are misleading as estimates of absolute risk.  I think I know what the authors have done wrong, but since the methods are not obviously available, it’s hard to know for sure the source of the problem.

Second, while there appear to be some differences in cancer risk in Hamilton, and these differences may warrant some concern, the differences are not large, and the map should better represent the reality, perhaps with less alarming contrasts in colour, or with some more context for interpreting differences in risk.  I am all for sharing data with the public, but it must be done properly, and I think this is an example of the opposite.  These ‘Code Red’ graphics may make for a sensational story, and may have boosted newspaper sales, but they are problematic, and should be interpreted with great care.