Coronavirus data: good news and bad news

Today I want to discuss novel coronavirus (SARS-CoV-2) testing and the widely available data resulting from this testing. My conclusion from this analysis is that in spite of the wealth of shared data out there, these data, and much of the analysis done using these data is not useful because existing data do not give much indication of the infection level in the population. The corollary, however, is that the most widely shared current case fatality rate estimate (ranging between 1 and 3%) is probably too high, and the true case fatality rate could easily be less than 0.5%, particularly in countries with functioning health care systems.


I’ll start by defining the key measures I use in this analysis: proportion of positive tests (PPT), the proportion of the population testing positive (PPTP), the proportion of the population tested (PT) and the case fatality rate (CFR).

The PPT is the ratio of positive tests to all tests conducted, and tells us the fraction of the population that has been selected for testing that has tested positive. PPT is an estimate of how likely a test will be positive in a population. PPTP is the ratio of positive tests to the total population, and is an estimate of the fraction of the population known to be infected based on positive test results. PT is a ratio of all tests conducted to the total population and tells us the fraction of the population that has been tested.

Here is a figure that helps visualize the difference between these concepts:


The large light blue circle is the entire population, the middle blue circle is the number of people tested, and the darkest blue circle is the number of people testing positive. Each of these circles are known quantities. Population is known from the census, the middle circle is the number of tests conducted by clinicians and health officials, and the smallest circles is the result of these tests (assuming that the tests are themselves accurate, which I think is generally accepted).

The last measure, CFR, is the ratio of all deaths linked to the coronavirus (the purple circle) to the total number of positive cases. In many ways, this is the most important number of all. If CFR was 0.1%, this coronavirus would not have gotten anywhere near the attention it has recieved; it is the fact that CFR appears to be so high (greater than 1% by most common estimates) that causes so much concern.

With the exception of PT, which is probably fairly accurate, what the rest of these measures represent is complicated by the fact that the methods of selecting the population for testing is not random. If people were randomly selected for testing, then PPT would be a great estimator of the true proportion of the population infected. If we saw changes in this figure over time, we could be confident that this change reflected a change in the level of infection in the population. Furthermore, we could fairly accurately estimate the CFT as well–we’d simply divide the deaths in the sample by the total number of positive cases.

As it turns out, the decision to test people is not based on a random selection of the population. For practical reasons, the test is administered to a subset of the population that meet some pre-test criteria for testing. Although the rules vary by jurisdiction, in general, testing appears to be increasingly focussed on high risk/vulnerable populations and health care workers. Earlier in the outbreak, tests were targetted at people with a high pre-test probability of infection (such as travellers to high risk countries with symptoms). Now, the testing decisions may have changed. In some jurisdictions, people who have travelled and have symptoms are simply told to self-isolate untested–under the assumption that they probably are an infected case. In other jurisdictions, the testing is becoming more widespread.

Lots of data!

There are mountains of data available on coronavirus, and seemingly thousands of data scientists creating beautiful maps and graphs online. I’ve made a few of my own (though, maybe they’re not that beautiful). Given that the process for selecting people for testing is not random, what are we to make of the data that underlie all this analysis? Are all these nice maps and graphs useful? Are the numbers right?

This is where things get a bit tricky. If the pre-test screening is very accurate at identifying cases (specifically, includes most or all infected people who will test positive and very few uninfected) then PPT is a poor measure of infection. In fact, it will drastically overestimate levels of infection.

However, if we knew this was the case, we could use PPTP as an estimate of infection rate–since the number of screened cases would be close to the real number of cases. Of course, if the screening process were that accurate, then we wouldn’t need the laboratory confirmed test in the first place. The unfortunate reality is that the screening process is effective at identifying some likely cases, but misses many others, and also includes many false positives (in fact, the vast majority are false positives).

The decision about who to test and not test is influenced by many factors, and unsurprisingly, testing frequency varies considerably around the world. In the US, tests (as measured by proportion of tested population, or PT) is lower than many other countries. Based on data I have found online, it’s around 2.3 per 10,000 people at present. In Canada, PT is around 13.5 per 10,000 people. In South Korea, the the number is around 60 per 10,000, and perhaps more. I can’t find any firm data on how many tests have been conducted in Germany, but apparently its less than South Korea, but more than most countries in Europe.

However, as I’ve hopefully made clear above, the number of tests does not necessarily influence the accuracy of our estimates of the proportion of the population infected. It does affect precision, and the ability to drill down into details–more tests mean increasingly local estimates of infection rates are possible. What matters more is the process or protocol for choosing who gets laboratory tests. The more random the process for selecting people, the more likely PPT can be used to accurately estimate the current proportion of infected population. The less random, the more uncertain we become.

So what are we left with? Well, implicitly, people seem to be using the ratio of infected people to total population (PPTP) as an indicator of the level of infection. This would be fine if we knew that the testing process captured every case, but we don’t know that. In fact, we can be pretty sure that many cases will go undetected, and that current and future case counts will be low.


People (including me) are excitedly making maps, and sharing all sorts of data on coronavirus infections. I have seen many very beautiful interactive online tools of infection counts that are fun to explore. As amusing as these are to play with, I am not sure many have been useful. The number of cases in a region is a product of the level of infection, the population size, the number of people tested and the process for selecting people for testing. At present, the impact of a non-random testing selection process leaves us uncertain of what the risk actually is, pretty well everywhere.

This means that the level of infection globally has enormous uncertainty to it–no surprise there, really. This is even more true in our specific communities. This is not just because some people are yet to be tested because they do not show symptoms, but because they may never be tested based on the screening process. Moreover, this process may vary from place to place, and even over time, so it will be hard to make anything but broad and general comparisons.

This uncertainty probably amounts to an underestimate of the true number of cases. It could be a small underestimate or a large one. It’s an underestimate because the test selection process in most jurisdictions is biased towards people who are vulnerable, have serious symptoms and/or have travelled; asymptomatic infections and infections from non travellers are probably being missed.

The good news is that if we are indeed under-counting the number of covid-19 cases, then we are over estimating CFR. The case fatality rates in Germany and South Korea, where there is more widespread testing, are less than 1%. However, even in these countries they are still not randomly selecting people for testing, and may not be testing enough to use PPTP as an estimator of the infection level. As a result, it’s still very possible that cases are being under-counted in these regions, and that the true CFR in South Korea and Germany is less than than 0.5%, or even less than 0.25%.

None of this changes the real impact of the coronavirus so far–thousands have died worldwide, and these deaths are tragic. Taking aggressive action to curtail the infection–even if the CFR is 0.5% or 0.25% can still be justified on public health and ethical grounds. Moreover, the collapse of health care systems remains a real threat, and can cause knock on effects, including deaths from other treatable conditions that are untreated because of health care system failures.


If accurate estimates of the proportion of the population affected is important to us, there are possible solutions. For one, there may be some re-sampling options for selecting quasi-random samples from the tested population. To do this would require information about the test subjects, and coordination between testing facilities. But it’s possible some re-sampling process could construct a synthetic ‘sample’ that is more generally representative of the population, and would give a better sense of the underlying proportion of persons infected.

There may also be some post-stratification options. This would involve weighting the tested populations so that under-represented observations are given greater weights, and over-represented observations are given smaller weights. I am not sure if this is possible, but I assume that someone is looking into it.

More testing could help, particularly if it reaches a breadth of the population. Low testing in some places around the world is almost certainly causing problems–in some cases, a false sense of security that could lead to more infection and more death when health care systems get hit with a spike of cases.

Random sampling of the population would solve the problem lickety-split, but that’s probably not going to happen. It would be expensive, particularly if it targeted regions or local areas. Moreover, how many people would subject themselves to a random coronavirus test by a government official knocking on the door? If the rejection rate was high, test refusal would end up biasing the data again.


Testing for coronavirus is important. It can be used for tracing the origin of cases, identifying people for isolation or quarantine, and determining whether or not the infection is present at all in a population. More testing has value, and as tests get easier and more widespread, the information will improve. Indeed, cheaper and easier testing (like home testing kits) could even be the key to getting control of the pandemic.

However, until testing becomes more representative (or we learn that the existing testing sample is already pretty representative) then we should all be wary of much of the data we see and use. The current counts could be close to the mark, could be a small under-estimate, or even a large under-estimate. If this is the case, it also means that current estimates of the case fatality rate could be greatly inflated.

Post script note: the morning I published this post, I read a post by John Ioannidis (published on March 17th) that states similar concerns to the ones I express above. Although I don’t draw the exact same conclusions, I think he raises some important questions, and we both agree that random-sample testing for SARS-CoV-2 in the population could be very useful.