Coronavirus data: good news and bad news

Today I want to discuss coronavirus (SARS-CoV-2) testing and the widely available data resulting from this testing. My conclusion from this analysis is that in spite of the wealth of shared data out there, these data, and much of the analysis done using these data is not useful because existing data do not give much indication of the infection level in the population. The corollary, however, is that the most widely shared current case fatality rate estimate (ranging between 1 and 3%) is probably too high, and the true case fatality rate could easily be less than 0.5%, particularly in countries with functioning health care systems.


I’ll start by defining the key measures I use in this analysis: proportion of positive tests (PPT), the proportion of the population testing positive (PPTP), the proportion of the population tested (PT) and the case fatality rate (CFR).

The PPT is the ratio of positive tests to all tests conducted, and tells us the fraction of the population that has been selected for testing that has tested positive. PPT is an estimate of how likely a test will be positive in a population. PPTP is the ratio of positive tests to the total population, and is an estimate of the fraction of the population known to be infected based on positive test results. PT is a ratio of all tests conducted to the total population and tells us the fraction of the population that has been tested.

Here is a figure that helps visualize the difference between these concepts:


The large light blue circle is the entire population, the middle blue circle is the number of people tested, and the darkest blue circle is the number of people testing positive. Each of these circles are known quantities. Population is known from the census, the middle circle is the number of tests conducted by clinicians and health officials, and the smallest circles is the result of these tests (assuming that the tests are themselves accurate, which I think is generally accepted).

The last measure, CFR, is the ratio of all deaths linked to the coronavirus (the purple circle) to the total number of positive cases. In many ways, this is the most important number of all. If CFR was 0.1%, this coronavirus would not have gotten anywhere near the attention it has recieved; it is the fact that CFR appears to be so high (greater than 1% by most common estimates) that causes so much concern.

With the exception of PT, which is probably fairly accurate, what the rest of these measures represent is complicated by the fact that the methods of selecting the population for testing is not random. If people were randomly selected for testing, then PPT would be a great estimator of the true proportion of the population infected. If we saw changes in this figure over time, we could be confident that this change reflected a change in the level of infection in the population. Furthermore, we could fairly accurately estimate the CFT as well–we’d simply divide the deaths in the sample by the total number of positive cases.

As it turns out, the decision to test people is not based on a random selection of the population. For practical reasons, the test is administered to a subset of the population that meet some pre-test criteria for testing. Although the rules vary by jurisdiction, in general, testing appears to be increasingly focussed on high risk/vulnerable populations and health care workers. Earlier in the outbreak, tests were targetted at people with a high pre-test probability of infection (such as travellers to high risk countries with symptoms). Now, the testing decisions may have changed. In some jurisdictions, people who have travelled and have symptoms are simply told to self-isolate untested–under the assumption that they probably are an infected case. In other jurisdictions, the testing is becoming more widespread.

Lots of data!

There are mountains of data available on coronavirus, and seemingly thousands of data scientists creating beautiful maps and graphs online. I’ve made a few of my own (though, maybe they’re not that beautiful). Given that the process for selecting people for testing is not random, what are we to make of the data that underlie all this analysis? Are all these nice maps and graphs useful? Are the numbers right?

This is where things get a bit tricky. If the pre-test screening is very accurate at identifying cases (specifically, includes most or all infected people who will test positive and very few uninfected) then PPT is a poor measure of infection. In fact, it will drastically overestimate levels of infection.

However, if we knew this was the case, we could use PPTP as an estimate of infection rate–since the number of screened cases would be close to the real number of cases. Of course, if the screening process were that accurate, then we wouldn’t need the laboratory confirmed test in the first place. The unfortunate reality is that the screening process is effective at identifying some likely cases, but misses many others, and also includes many false positives (in fact, the vast majority are false positives).

The decision about who to test and not test is influenced by many factors, and unsurprisingly, testing frequency varies considerably around the world. In the US, tests (as measured by proportion of tested population, or PT) is lower than many other countries. Based on data I have found online, it’s around 2.3 per 10,000 people at present. In Canada, PT is around 13.5 per 10,000 people. In South Korea, the the number is around 60 per 10,000, and perhaps more. I can’t find any firm data on how many tests have been conducted in Germany, but apparently its less than South Korea, but more than most countries in Europe.

However, as I’ve hopefully made clear above, the number of tests does not necessarily influence the accuracy of our estimates of the proportion of the population infected. It does affect precision, and the ability to drill down into details–more tests mean increasingly local estimates of infection rates are possible. What matters more is the process or protocol for choosing who gets laboratory tests. The more random the process for selecting people, the more likely PPT can be used to accurately estimate the current proportion of infected population. The less random, the more uncertain we become.

So what are we left with? Well, implicitly, people seem to be using the ratio of infected people to total population (PPTP) as an indicator of the level of infection. This would be fine if we knew that the testing process captured every case, but we don’t know that. In fact, we can be pretty sure that many cases will go undetected, and that current and future case counts will be low.


People (including me) are excitedly making maps, and sharing all sorts of data on coronavirus infections. I have seen many very beautiful interactive online tools of infection counts that are fun to explore. As amusing as these are to play with, I am not sure many have been useful. The number of cases in a region is a product of the level of infection, the population size, the number of people tested and the process for selecting people for testing. At present, the impact of a non-random testing selection process leaves us uncertain of what the risk actually is, pretty well everywhere.

This means that the level of infection globally has enormous uncertainty to it–no surprise there, really. This is even more true in our specific communities. This is not just because some people are yet to be tested because they do not show symptoms, but because they may never be tested based on the screening process. Moreover, this process may vary from place to place, and even over time, so it will be hard to make anything but broad and general comparisons.

This uncertainty probably amounts to an underestimate of the true number of cases. It could be a small underestimate or a large one. It’s an underestimate because the test selection process in most jurisdictions is biased towards people who are vulnerable, have serious symptoms and/or have travelled; asymptomatic infections and infections from non travellers are probably being missed.

The good news is that if we are indeed under-counting the number of covid-19 cases, then we are over estimating CFR. The case fatality rates in Germany and South Korea, where there is more widespread testing, are less than 1%. However, even in these countries they are still not randomly selecting people for testing, and may not be testing enough to use PPTP as an estimator of the infection level. As a result, it’s still very possible that cases are being under-counted in these regions, and that the true CFR in South Korea and Germany is less than than 0.5%, or even less than 0.25%.

None of this changes the real impact of the coronavirus so far–thousands have died worldwide, and these deaths are tragic. Taking aggressive action to curtail the infection–even if the CFR is 0.5% or 0.25% can still be justified on public health and ethical grounds. Moreover, the collapse of health care systems remains a real threat, and can cause knock on effects, including deaths from other treatable conditions that are untreated because of health care system failures.


If accurate estimates of the proportion of the population affected is important to us, there are possible solutions. For one, there may be some re-sampling options for selecting quasi-random samples from the tested population. To do this would require information about the test subjects, and coordination between testing facilities. But it’s possible some re-sampling process could construct a synthetic ‘sample’ that is more generally representative of the population, and would give a better sense of the underlying proportion of persons infected.

There may also be some post-stratification options. This would involve weighting the tested populations so that under-represented observations are given greater weights, and over-represented observations are given smaller weights. I am not sure if this is possible, but I assume that someone is looking into it.

More testing could help, particularly if it reaches a breadth of the population. Low testing in some places around the world is almost certainly causing problems–in some cases, a false sense of security that could lead to more infection and more death when health care systems get hit with a spike of cases.

Random sampling of the population would solve the problem lickety-split, but that’s probably not going to happen. It would be expensive, particularly if it targeted regions or local areas. Moreover, how many people would subject themselves to a random coronavirus test by a government official knocking on the door? If the rejection rate was high, test refusal would end up biasing the data again.


Testing for coronavirus is important. It can be used for tracing the origin of cases, identifying people for isolation or quarantine, and determining whether or not the infection is present at all in a population. More testing has value, and as tests get easier and more widespread, the information will improve. Indeed, cheaper and easier testing (like home testing kits) could even be the key to getting control of the pandemic.

However, until testing becomes more representative (or we learn that the existing testing sample is already pretty representative) then we should all be wary of much of the data we see and use. The current counts could be close to the mark, could be a small under-estimate, or even a large under-estimate. If this is the case, it also means that current estimates of the case fatality rate could be greatly inflated.

Post script note: the morning I published this post, I read a post by John Ioannidis (published on March 17th) that states similar concerns to the ones I express above. Although I don’t draw the exact same conclusions, I think he raises some important questions, and we both agree that random testing for SARS-CoV-2 in the population could be very useful.

Coronavirus epi curves

I did some analysis of epidemiology curves for coronavirus. This particular curve plots out the cumulative proportion of cases over time for a number of countries:

Coronavirus epi curves

Each point on the line is a proportion of the total — which is why they all touch at the far right; all countries are at their daily cumulative maximum
(1.0) as of March 16th.

The graphs differ across counties in two important ways. First, they are shifted in time. This shows something we already know–that China and south-east Asia got hit with the infection first, and Western Europe and North America more recently.

More interesting is the shape of the curves. Notice that the rate of increase has been flattening out for China for some time. South Korea has is seeing a more recent flattening. Countries in Europe and North America are seeing a large increase now.

The most noteworthy line on this graph is Japan. Japan is seeing a slow and steady growth in cases, something that is typically not what infectious disease models predict. Usually growth, and often decline, tends to be nonlinear–a fast rise followed by a fast drop (and then a possible return with a lower amplitude). It’s hard to know what to make of this.

Is it because Japan is under-testing or under-reporting? Or is it that public health interventions were implemented very quickly and effectively in Japan? Only time will tell… Here’s the code for you to see for yourself.

Covid-19 update

Covid-19 top 25!

I have created a daily updating web page with infection rates for the countries in the top 25 of total infections diagnosed, as well as Canadian provinces. Data are from Du and Gardner (see their Lancet publication here) but are ‘scraped’ automatically from their data on GitHub so that I don’t have to update it manually every day. However, this data source is not entirely up to date, so I am adding newer data sources over time, as well as doing some validation work, so bookmark it, sucka!

A halt to the NHL season

Professional sports leagues are shutting down. If the NHL cancels the season altogether, this means that fans of the Edmonton Oilers can confidently say that their team will not not make the playoffs this year! Go Oilers!

The Coronavirus blues

Coronavirus blues

On Losers

What is a loser?

In the traditional sense of the word, a loser is simply someone that has not won. This is a descriptive and sometimes useful definition, but of course ‘loser’ is often used to imply a pathology of failure–someone who never wins, who can’t succeed at anything, and that we as a society don’t value.

I am not satisfied with this definition, so I offer an alternative:

Loser: a person for whom success is uncorrelated or negatively correlated with demonstrable merit.

Using this definition, a person can be a loser in two ways:

  1. A person with great potential that is wasted.
  2. A person with great success due to something other than demonstrated merit.

Both 1 and 2 require a little explanation. First, consider the concept of wasted potential. To waste potential means failing to live up to what one could have done had they tried. Trying and failing doesn’t make someone a loser. A loser is anyone who doesn’t make use of the talents they have due to things like fear of failure, sloth, or sense of entitlement.

Second, consider the concept of merit. Merit is ability, skill or talent (inherited or not) that is relevant to success. A great hockey player that records a best selling album of mediocre country music is a loser in the country music domain. His success in hockey doesn’t demonstrate his merit in other areas. So the hockey player is a loser in one domain and not a loser in the other. For merit to be meaningful, it has to be linked to success.

Why should you accept my definition?

  1. Calling someone a ‘loser’ in the traditional sense is to ignore important many uncontrollable factors that contribute to failure and lack of success–like bad-luck. People should not be held responsible for bad luck. Bad luck doesn’t make a person a loser.
  2. Undeserved success is economically inefficient. Merit-less success rewards people based on attributes that are not relevant to the creation of value. Labelling people with unmerited success as ‘losers’ is a way of knocking them off their roost (at least verbally), and perhaps making way for those more deserving of praise.
  3. Fear of being thought of as a ‘loser’ in the traditional sense may discourage risk-taking. The world benefits from some risk-taking; it doesn’t make sense to condemn people for trying and failing. Trying and failing is taking one for the team. By my definition, trying and failing doesn’t make someone a loser.
  4. If you accept my definition, you won’t know whether a person is a loser or not unless you get to know them, and compare their success against their merit. Nobody is a loser by default.

Here are some examples to help illustrate the meaning and usefulness of this term.

At least two of Donald Trump’s children are clearly losers (Jr. and Ivanka). As far as I am aware, they have not been tested in this world, yet they seem to wield great power and reap the rewards of financial success. They did not ‘make it’ in business or politics or anything in a way that demonstrates a competence commensurate with their positions. Losers.

Many elected officials are losers. People do not win elections by demonstrating an ability to craft useful legislation or make good decisions. They win through a mix of spectacle, popularity contests and back stabbing. The higher you go in political rank, the more likely you are a loser. So yah, Justin Trudeau is probably a a loser. On the one hand, he did earn his seat in Montreal through by hustling door-to-door, shaking hands, kissing babies and so on. On the other hand, as Prime Minister of Canada he is clearly the beneficiary of his father’s name and wealth. Sorry J.T., you’re a loser.

Dictators are always losers; their power is not commensurate with demonstrated abilities in governing. They are usually good at employing violence, bullying and other skills best suited for other kinds of occupations.

Big lottery winners are all losers. Nobody that wins the lottery wins because of merit, just dumb luck. Lottery losers are losers too (in the traditional sense), so it would seem that playing the lottery is for losers.

Athletes are pretty well never losers; sports are great at demonstrating merit. People don’t win at sports by luck alone; any sport that you can win based on luck alone isn’t a sport. People who try and lose at sports aren’t losers either. To paraphrase 90% of the high school gym teachers who ever lived: the only losers are those who don’t try!

On the other hand, super famous actors and musicians are all kind of losers. We may love them, but demonstrating merit in the acting/music industries is pretty tricky. I firmly believe that any rock ‘n’ roll orchestra that is filthy, insanely successful in money and fame should be viewed with suspicion. Sure, Nickelback has sold millions of records, but they are clearly losers.

Mark Zuckerberg, Bill Gates, Steve Jobs, Jeff Bezos: mostly losers. They all have demonstrated some merit, but it is pretty weakly correlated with their success. Check out a recent computer simulation that makes this point beautifully. If they were multi-millionaires, they may not have been losers. But as billionaires, losers they be.

Doctors make a lot of money, and are highly regarded culturally, but most aren’t losers. Doctors are under close professional scrutiny, and it takes a lot of work to enter the profession–most of them have to work their butts off to get into and out of medical school, especially these days. People who are bad at these jobs don’t keep them for very long. Doctors are not losers. Of course, most people in health care aren’t losers. They have tough jobs and neither their pay or cultural success is excessive. The same can be said about school most teachers.

However, many University professors are kind of losers. Universities are not good at getting rid of under-performers, and success (in terms of salary and reputation) usually goes up over time irrespective of merit. There is evidence that the research system as a whole is not always great at allocating success (see the same simulation above). I’ll fully acknowledge that I am probably at least 37.5% loser (+/- 20% 19 times out of 20).

Conclusions: we’re mostly not losers

Most regular folks with jobs, and/or families and/or that contribute to their communities are not losers. In fact, 80-90% of the human population are probably not losers. Your neighbour who keeps you up at night with his banjo music and pool parties? Sorry, he’s probably not a loser, just annoying.

Post script

My younger brother is in a hard working rock ‘n’ roll orchestra who slave away for the love of it, so obviously they are not losers. Here is the video from their most recent album, Loser Delusions: