Take a look at virtually any paper in the social sciences (including epidemiology and business) that uses regression modelling in their analysis and you are likely to find, somewhere, a measure of ‘model fit’. The most common measure is the R² (or the similar ‘adjusted R²’ and ‘pseudo R²’). The technical differences between these matter, but I won’t discuss them here.
The R² is commonly interpreted as indicating the proportion of variance of a variable that is explained by a model comprised of one or more other variables. The highest possible R² is 1, and the lowest is 0; the former says the model explains all variation, and the latter says the model explains no variation.
As an example, lets consider the relationship between infant mortality and the human development index. Here is a scatter plot of this relationship (with infant mortality on the log scale). Each dot is a country plotted on these two axes.
If we wanted to quantify this visual relationship in mathematical form, we use a regression model. Using the data in the graph and a trivial bit of math we can find the equation of a line that approximates the visual pattern in these data:
Log(Infant mortality rate) ~ 8 – 7.5*HDI
and a measure of how well that line fits the data using the R² (0.60). This would typically be interpreted to mean that HDI explains 60% of the variation in the log of infant mortality between countries. Assuming the model is generalizable to the future, and if I knew the future trajectory of HDI for countries in the world, I could use this model to make a prediction about where the future country-specific infant mortality rates will be.
It is important to understand that these kinds of models are not usually used to explore relationships as an historical curiosity, but are used to claim an understanding of relationships at a more fundamental level, and at least implicitly, about what we should expect in the future. In this example, I wouldn’t care about the relationship between HDI and infant mortality in the past if I knew that the relationship between these measures was entirely different today. When we see these graphs (and the corresponding numerical models that summarize them) I am assuming that the relationships have some persistence and meaning now and in the future. Accordingly, the value of this, and indeed any model is proportional to how well it can predict the future. A good model informs me about the future, a bad model does not.
Numerical models are used by engineers, applied physicists and other scientists all the time. They can use these data to come up with equations that describe how a change in a series of inputs (e.g., weather conditions, speed, vehicle weight) changes some output (e.g., motor-vehicle breaking distance) with a high degree of regularity. Models work in these fields because the fundamental laws are fairly stable; a model that predicts the behaviour of a physical system accurately today is usually just as good at predicting that system in the future. This is because these fields of study work with stable laws; at human scales of observation laws of physics do not seem to change over time.
The problem with regression in the social sciences is that the human systems we study are not understandable using the same persistent laws observed in the applied physical sciences. To be clear, I am not suggesting that humans aren’t governed by physical laws, but simply that our understanding the physical laws concerning humans is very incomplete. In short, the vagaries of human behaviour make models involving people–particularly at the level of individuals–less useful than most models used in the natural sciences.
I don’t think much of the above is controversial, and in fact, much has been written about the limits of numerical models in the social sciences. The specific problem that concerns me is how R² in particular is often misused to over-state the value of models in the social sciences by failing to account for the fundamental limitations of social science research in general.
R² is usually derived endogenously. This means that it measures the fit of a model to the data that were used to come up with the model. So the R² does not tell me anything about the ability of a model to actually predict anything in the future (or even in another context or setting), but simply how well a model ‘predicts’ the data used to generate the model. The smaller the ‘prediction’ error, the larger the R², and the better the model fit. This endogenous measure of fit is probably appropriate if the fundamental nature of the relationships described by the model do not change over time–again, something that is probably true of many natural sciences. However, for models of human systems, which are both temporally and geographically dynamic, it is almost always going to be an over-estimate of a model’s actual predictive value.
So in the example above, while the R² is around 0.60, the actual information the model is telling us about this relationship is almost certainly less than what the R² implies. If I wanted to predict infant mortality in 2017 based on future HDI predictions combined with the model above, the predictions would almost certainly contain more error than the error associated with the same model ‘predicting’ data in 2013. Perhaps even considerably more error.
The solution to this problem is not a rejection of quantitative social science. Numerical models may still have value in the social sciences, but simply need to be interpreted more honestly.
One solution is to use exogenous data to measure the fit the model. This is typically done in data mining fields where a ‘test’ data set is used to measure the accuracy of a model built from a ‘training’ data set. The specific process involves finding data from other contexts (at future time points, and perhaps at different places) and testing the ability of the model to predict these data. This does not fully address the problem–as the predictions might still be wrong at some other time in the future, or in other contexts–but is a more honest representation of a model’s value.
Another option is to not report R². This is becoming my favoured approach. It’s not to say that reporting R² has no value, but it’s too easy to abuse, and it is of questionable value in most applications in the social sciences. Not only are human systems complex and had to predict, we don’t even fully understand why they are so hard to predict. So we probably can’t make informed corrections to the R² to account for this unpredictability.
The R² reported in the social sciences rarely means what is implied in an absolute or relative sense; the R² tells us very little about how good a model is at predicting the future, and without prediction, the value of a model is very hard to judge. At best, the R² is an exaggeration of a model’s value, and at worst it is downright misleading. There may be some exceptions–I can’t claim to know all applications of numerical models in the social sciences–but my feeling is that the R² is of low marginal value, particularly in the era of big data. What’s better is for researchers to make predictions with their models, assess the predictions, and acknowledge the factors that could influence the predictive usefulness of these models in the future.