Logistic and linear regression

Logistic regression is frequently used to estimate the parameters of a model when a dependent variable is dichotomous (e.g., yes/no or 1/0 or case/control).  It is perhaps most often used in health research since the outcomes of most common concern–disease and death–can be classified reasonably well into a yes/no category.  Logistic regression is preferred to linear regression because while the dependent variable is still predicted by a linear combination of independent variables and their coefficients, the predictions from linear regression are not constrained to the limits of the dichotomous classes.  So in some cases, a linear model can make predictions that are invalid (such as negative probabilities or probabilities above 1). Logistic regression doesn’t have this problem.

Many years ago a statistical mentor of mine told that linear regression is usually sufficient in practice, and that logistic regression is often an unnecessary bother.  In fact, he said he never came across an instance where he would interpret his data differently based on whether or not he used linear or logistic regression to estimate the parameters of his model.  Furthermore, the linear regression coefficient is easier to interpret; it is simply a proportional change in the dependent variable given a 1 unit change in the independent variable.  Logistic regression requires one to interpret change in log-odds (or an odds ratio), neither of which are as intuitive.

To examine this idea, I wrote a little simulation in R to see how often the signs of regression coefficients change based on the precise modelling method (sorry for the loops…):

set.seed(0)
results_final <- 0

for(j in seq(from=.01, to=1, by=.01))
{
 for(i in 1:100)
 {
 X <- rnorm(1000)
 mu = -4.5 + j*X
 p = 1/(1+exp(-mu))
 y = rbinom(1000,1,p)
 
 df <- data.frame(cbind(y,X))
 glm_out <- coefficients(glm(y~X,data=df,family="binomial"))
 lm_out <- coefficients(lm(y~X,data=df))
 iteration <- i
 events <- sum(df$y)
 results <- cbind(j, iteration, events, glm_out[2], lm_out[2])
 results_final <- rbind(results, results_final)
 }
}
df <- data.frame(results_final)
df$match <- ifelse(df$V4*df$V5 > 0,1,0)
df <- df[df$j != 0,]
summary(df$match)

After half an hour of fiddling with the intercept and coefficient I could not create data that would result in systematically different signed coefficient for the two different parameter estimation methods.  In other words, when one’s intent is simply to interpret the signs of coefficients, it doesn’t seem to matter a whole lot which model one uses.

The other (and perhaps more important) problem with using linear regression on these occasions is that the statistical tests associated with the linear model are not appropriate when the dependent variable is dichotomous (Hellevik 2009).  However, for those of us working with ‘Big Data’, statistical significance is often not our primary concern; we often have enough data to achieve ‘significance’ of almost anything.  What matters more is often effect size and elegance, which can benefit from the simplicity and parsimony of the linear regression model.

I generally follow the conventional wisdom and do what statisticians recommend, so I will probably still use logistic regression when the dependent variable demands it. However, it is good to know that the reliable old linear model is very often good enough, and in that way, can remain a workhorse of multivariate analysis…

For more reading on the subject, I highly recommend

Hellevik O (2009) Linear versus logistic regression when the dependent varialbe is a dichotomy.  Quality and Quantity 49:59-74.