Alcohol and motor-vehicle collisions: the dangers of inexperienced drinkers

In their paper “The contribution of alcohol to serious car crash injuries”, Connor et al. show a fairly strong association between alcohol consumption and the odds of getting into a serious car accident.  Specifically, a person who consumes 2 or more drinks has 8 times the odds of getting into serious injury car crash in the following 6 hours when compared to a person who has not consumed alcohol.  This is pretty intuitive, and the study seems fairly well done.

But there’s something strange going on…

One dimension of the findings that is curious (though not really discussed by the authors of the study) is that being a frequent drinker (e.g.,drinking 6-7 days a week) seems somewhat protective against the odds of getting into serious injury car crash.  The following table from their paper is illustrative:


Compared to nondrinkers, persons who drink 6-7 days a week have a 0.3 times odds of being experiencing a serious car injury.  This is roughly the equivalent of saying that frequent drinkers have ~ 70% reduction in the probability of getting into a serious car accident after controlling for demographic factors and the consumption of alcohol in the last 6 hours.

The real question is what is the net effect of drinking in the last 6 hours on frequent drinkers and non-drinkers.  I can’t estimate this precisely based on the information presented in the study, but I can get a ballpark figure.

First, consider a simplified equation for the model they used:

log(p/(1-p)) = B0 + B1x1 + B2x2

where B0 is the intercept, B1 is the coefficient for consuming alcohol in the last 6 hours and B2 is the coefficient for being a frequent drinker.  The variables x1 and x2 are dichotomous (1 or 0), indicating whether or not a person consumed alcohol in the last 6 hours (x1) and whether or not a person is a frequent drinker (x2).

I will estimate B0 as 0.001, and the value of B1 and B2 can be found by transforming the odds ratios in the table.  Using these inputs, I can derive model predicted log odds, convert these into probabilities, and then plot them on a graph:alcohol

This graph says two things.  First, drinking increases the risk of car accidents, whether or not you are a frequent drinker.  Second (and perhaps more controversially), frequent drinkers who drink are at relatively lower risk of getting injured in a serious car accident when compared to non-drinkers who drink.  Of course, since non-drinkers drink infrequently (or perhaps never if their name means anything), the total public health impact of frequent drinkers is probably greater than non-drinkers.  Nonetheless, these results could suggest that drinking and driving may be a particular concern among persons who are inexperienced as drinkers–perhaps due to lower alcohol tolerance, or less experience in judging their level of impairment.

There are a number of important caveats to consider here–including the fact that I do not have access to the original data.  However, the results are not entirely unbelievable either; it seems plausible that all else being equal a frequently drinker may be less impaired by a small quantity of alcohol than an inexperienced drinker.  This doesn’t mean drinking alcohol and operating a motor-vehicle is safe–it clearly isn’t–but just that the risks are complex.


Connor, Jennie, et al. “The contribution of alcohol to serious car crash injuries.” Epidemiology 15.3 (2004): 337-344.

The 2 principles of ‘N principles’

I was watching a documentary recently in which the content was organized into ’10 principles’.  I am not sure if those principles came from the producers, or the subject of the documentary, but it lead me to ask whether this number ’10’ was meaningful, or just a number selected with little to no correspondence to the actual number of principles on the topic.  We see ‘N principles’ all the time, such as the “three principles of cell theory“, the “five principles of humanity” and the “eight principles of pilates“, but are the number of principles really meaningful?

If everyone who made up such ‘principles’ did so purely guided by the actual number of principles defined by the problem they are trying to categorize, we might expect that the frequency distribution of principles is completely uniform–in other words, there should be as many sets of ‘4 principles’ as ‘343,212 principles’.  This is because the number of available subjects to categorize in the universe is infinite, and the categorization process is probably complex enough to defy any universal rule or ‘principle’ of principle making.

However, in practice, we humans prefer simplicity (that’s why we come up with principles in the first place!).  Our brains our finite, and we like to reduce the complexities of the world into as few variables, parameters and categories as possible.  So in fact the frequency of ‘N principles’ is most likely inversely proportional to N:

theoretical graph of principle frequencies

Now, is this true in practice?  To find out I used the Google search engine to return search results on the phrase “The N principles”, where N is numbers (in numeric and written form) from 2 to 20.  The search results tell us the amount of content on the internet for a given search term–larger values mean more content for a given value of N.  Here is a plot of the log of the search results by N:

search results by principles

This seems roughly consistent with the general idea that we prefer fewer principles over many, but note that there are anomalies.  The numbers 7, 10, 12, 17 and 20 all seem to be over-represented.  It is possible that these findings are partly the result of some outlier content.  For example, it seems very likely that many of the “7 principles” search results come from one book “The Seven Principles for Making Marriage Work“.  The same seems to be true for the “17 principles”.  However, I wonder if the 10, 12 and 20 principle anomalies may reflect an inclination in favour of certain numbers because they are more memorable, or seem more authoritative in some way.  If you identify ‘9 principles’ in a system, maybe you would split one of the principles into two principles to get a nice even ’10 principles’.  This seems feasible given that on the graph ‘9 principles’ seem to be an anomaly in the other direction (less content than expected).

Naturally one might ask whether or not this model is useful.  Well, it probably isn’t. Nevertheless, I used Google Trends to explore whether or not Google searches of ‘N principles’ correlate with the the volume of content on the internet as seen in the figure above.  This allows me to identify any gaps between what people are searching for and what is actually available on the internet.  Here is the graph:

searches by principles

The data are from Google Trends based on Google searches from 2004 to 2016. Searches for ‘N principles’ greater than 14 did not result in enough data for Google Trends to estimate relative search frequency.  However, we can see that 5, 7, 8, 10, 12 and 14 principles seem the most popular.  Of these 8 and 14 seem the most interesting since neither of these were outliers in the web content search above.  It’s also interesting how infrequently people search for ‘2 principles’.  Perhaps this means that ‘2 principles’ is an oversimplification much of the time?


I have two conclusions (aka principles):

1.  People who categorize problems/systems/practices into ‘principles’ seem to overly favour numbers like 10 and 12.  As such, there are probably times when the actual number of useful principles is more or less than that.  This could be handy to know when someone is telling you about the ’10 principles to training a cat’ at a dinner party (‘excuse me, but are you sure there aren’t really nine?’).

2. If you want to capitalize on the gap between supply (the number of sets of principles on the internet) and demand (the number of sets of principles searched for) then you should pick 8 or 14 principles, and whatever you do, don’t pick two principles.

Logistic and linear regression

Logistic regression is frequently used to estimate the parameters of a model when a dependent variable is dichotomous (e.g., yes/no or 1/0 or case/control).  It is perhaps most often used in health research since the outcomes of most common concern–disease and death–can be classified reasonably well into a yes/no category.  Logistic regression is preferred to linear regression because while the dependent variable is still predicted by a linear combination of independent variables and their coefficients, the predictions from linear regression are not constrained to the limits of the dichotomous classes.  So in some cases, a linear model can make predictions that are invalid (such as negative probabilities or probabilities above 1). Logistic regression doesn’t have this problem.

Many years ago a statistical mentor of mine told that linear regression is usually sufficient in practice, and that logistic regression is often an unnecessary bother.  In fact, he said he never came across an instance where he would interpret his data differently based on whether or not he used linear or logistic regression to estimate the parameters of his model.  Furthermore, the linear regression coefficient is easier to interpret; it is simply a proportional change in the dependent variable given a 1 unit change in the independent variable.  Logistic regression requires one to interpret change in log-odds (or an odds ratio), neither of which are as intuitive.

To examine this idea, I wrote a little simulation in R to see how often the signs of regression coefficients change based on the precise modelling method (sorry for the loops…):

results_final <- 0

for(j in seq(from=.01, to=1, by=.01))
 for(i in 1:100)
 X <- rnorm(1000)
 mu = -4.5 + j*X
 p = 1/(1+exp(-mu))
 y = rbinom(1000,1,p)
 df <- data.frame(cbind(y,X))
 glm_out <- coefficients(glm(y~X,data=df,family="binomial"))
 lm_out <- coefficients(lm(y~X,data=df))
 iteration <- i
 events <- sum(df$y)
 results <- cbind(j, iteration, events, glm_out[2], lm_out[2])
 results_final <- rbind(results, results_final)
df <- data.frame(results_final)
df$match <- ifelse(df$V4*df$V5 > 0,1,0)
df <- df[df$j != 0,]

After half an hour of fiddling with the intercept and coefficient I could not create data that would result in systematically different signed coefficient for the two different parameter estimation methods.  In other words, when one’s intent is simply to interpret the signs of coefficients, it doesn’t seem to matter a whole lot which model one uses.

The other (and perhaps more important) problem with using linear regression on these occasions is that the statistical tests associated with the linear model are not appropriate when the dependent variable is dichotomous (Hellevik 2009).  However, for those of us working with ‘Big Data’, statistical significance is often not our primary concern; we often have enough data to achieve ‘significance’ of almost anything.  What matters more is often effect size and elegance, which can benefit from the simplicity and parsimony of the linear regression model.

I generally follow the conventional wisdom and do what statisticians recommend, so I will probably still use logistic regression when the dependent variable demands it. However, it is good to know that the reliable old linear model is very often good enough, and in that way, can remain a workhorse of multivariate analysis…

For more reading on the subject, I highly recommend

Hellevik O (2009) Linear versus logistic regression when the dependent varialbe is a dichotomy.  Quality and Quantity 49:59-74.

Data repository

I am going to be posting geographical and other data on the site fairly regularly, and have created a page for doing so (look at the ‘Open data sets’ menu above, or click here to go there directly).  The data are free to use under the creative commons license.  Most data will be repackaged and reformatted data that someone else made, but in some cases, will be my own data.