7. Linear Models
I’m lying, yes, but why do you force me to give a linear explanation; linear
explanations are almost always lies.
—Elena Ferrante
Often models posit specific functional relationships between variables.
That relationship could be linear, concave, convex, or S-shaped, or it could
include threshold effects. Of these, linear models are the simplest, the most
widely used, and the focus of this chapter. The effects of education on
income, of gains in life expectancy from exercise, and of income on voter
turnout can all be measured using linear models.
We begin the chapter with a refresher on linear functions with a single
variable. We then show how regression fits data to a linear function,
revealing the sign, magnitude, and significance of effects. We also discuss
why errors, noise, and heterogeneity mean that data do not fall exactly on
the regression line. We then expand the linear model to allow for more
variables and discuss how to fit multivariable linear models. To build
intuition for multiple variable models, we describe a model of success as a
linear function of skill and luck. The chapter concludes with an observation
of how relying on data and regressions to guide action limits mistakes but
can also produce marginal, conservative actions. This big-coefficient
thinking can stifle innovation. To identify more innovative options, we
might consider constructing other, more speculative models.
Linear Models
In a linear relationship, the amount of change in one variable due to a
change in a second variable does not depend on the value of the second
variable. If the height of a tree is linear with the tree’s age, the tree grows
by the same amount each year. If the value of a house increases linearly in
its square footage, a 200-square-foot addition increases a house’s value by
double that of a 100-square-foot addition. A 400-square foot addition
increases the house by four times as much.
Linear Models
In a linear model changes in the independent variable, x, result in linear
changes in a dependent variable, y, as follows:
y = mx + b
where m equals the slope of the line and b equals the intercept, the value of
the dependent variable when the independent variable equals zero.
A linear regression model finds the line that minimizes the distance to the
data points. Linear regression can explain variation in crime, washing
machine sales, and even wine prices.1 Suppose that we have data for adults
ranging in age from twenty to sixty and the distances they walk each week
and find the following regression equation:
Miles Personi = -0.1 · agei + 12 + εi
This regression equation tells us the sign of the effect (distance decreases
with age) and the magnitude of that effect (each year of age reduces
distance by one-tenth of a mile). In this example, the intercept has no
relevance because it lies outside our data range, that is the data includes no
one with an age near zero. Based on the equation, we expect a forty-yearold to walk eight miles per week and a fifty-year-old to walk seven. The
data used to produce a regression will not fall exactly on the regression
line. Figure 7.1 shows hypothetical data used to produce our regression
line. The person represented by the gray circle, Bobbi, is age forty and
walks eleven miles per day. She exceeds the model’s estimate by three
miles. To make the data consistent with the model, the equation includes an
error term for each data point. The error term, denoted by ε, equals the
difference between what the model estimates and the actual value of the
dependent variable. Bobbi’s ε term equals +3 miles.
In social and biological contexts, we do not expect perfect linear fits.
Outcomes depend on many variables, and a single-variable regression, by
definition, includes only one variable. Predicted values can deviate from
the actual values because of these omitted variables. Bobbi may walk more
than expected because, as a botany professor, she takes her students out for
walks in the woods. The model does not include profession as a variable,
which contributes to why the data in 7.1 do not lie on the line. The ε term
could also result from measurement error. Fitness data collected by
smartphones will contain errors if people forget to carry their smartphone
or loan their phones to others. Error can also arise from environmental
noise—people may earn extra distance for bumpy car rides to work.2
image
Figure 7.1: A Scatterplot and Regression Line
The closer the regression line lies to the data, the more of the data the
model explains and the larger the model’s R-squared (the percentage of
variation explained). If all data lie exactly on the regression line, the Rsquared equals 100%. All else equal, we prefer models with higher Rsquared values.
Sign, Significance, and Magnitude
Linear regression tells us the following about coefficients of independent
variables:
Sign: The correlation, positive or negative, between the independent
variable and the dependent variable.
Significance (p-value): The probability that the sign on the coefficient is
nonzero.
Magnitude: The best estimate of the coefficient of the independent
variable.
In a single-variable regression, the closer fit to the line and the more data,
the more confidence we can place in the sign and magnitude of the
coefficient. Statisticians characterize the significance of a coefficient using
its p-value, which equals the probability, based on the regression, that the
coefficient is not zero. A p-value of 5% means a one-in-twenty chance that
the data were generated by a process where the coefficient equals zero. The
standard thresholds for significance are 5% (denoted by *) and 1%
(denoted by **). Significance is not all we care about. A coefficient can be
significant yet of small magnitude. If so, we can be confident of the
correlation but the variable has little effect. Or a coefficient can be large
though not significant. This often occurs with noisy data or data with many
omitted variables.
To see how to use regressions to guide action, imagine a company that
ships spices. This company offers over a hundred types of spices.
Customers buy packages of six, twelve, or twenty-four spices, which
employees pack and ship. A regression estimating the number of orders
shipped per eight-hour shift as a function of the number of years an
employee has worked produces the following:
# Orders Filled = 200 + 20∗∗· Years
The coefficient on years, 20, is significant at the 1% level. We can be
confident it is positive. If the relationship is causal (see below), the model
can be used to predict the number of orders that each employee can fill per
shift as a function of years of work and we can use the model to project
how many orders the current employees will fill next year. Here we have an
instance of a model both making a prediction and guiding an action.
Correlation vs. Causation
Regression only reveals correlation among variables, not causality.3 If we
first construct a model and then use regression to test if the model’s results
are supported by data, we do not prove causality either. However, writing
models first is far better than running regressions in search of a significant
correlate, a technique known as data mining. Data mining runs the risk of
identifying a variable that correlates with other causal variables. For
example, data mining might find a significant positive correlation between
vitamin D levels and general health. People absorb vitamin D from
sunlight, so the effect could be due to the fact that people with active
lifestyles spend more time outdoors and have better health. Or a regression
might find that a school’s academic performance correlates strongly with
the number of students on its equestrian team. Equestrian teams likely have
no direct causal effect but they correlate with family income and school
funding levels which do.
Data mining can also result in spurious correlations, where just by chance
two variables are correlated. We might find that companies with longer
names earn higher profit or that people who live near pizza restaurants are
more likely to get the flu. With a 5% significance threshold, one in every
twenty variables we test will be significant. So, if we try enough variables,
we will surely find significant (and spurious) correlations.
We can avoid reporting spurious correlations by creating training sets and
testing sets. A correlation found on the training set that also holds on the
testing set is far more likely to be true. We still have no guarantee of a
causal relationship, however. To prove causality, we need to run an
experiment where we manipulate the independent variable and see if the
dependent variable changes. Or we look for a natural experiment where this
has happened by chance.
Multivariable Linear Models
Most phenomena have multiple causal and correlative variables. A person’s
happiness can be attributed to health, marital status, offspring, religious
affiliation, and wealth. The value of a house depends on square footage, lot
size, the number of bathrooms, the number of bedrooms, the type of
construction, and the quality of local schools. All of these variables can be
included in a regression to explain housing values. We must keep in mind,
though, that as we add more variables, we need more data to obtain
significant coefficients.
Before discussing multiple-variable regression, we build intuition for
multiple-variable equations by introducing Mauboussin’s skill-luck
equation.4 The equation writes success, be it in work, sports, or games, as a
weighted linear function of skill and luck.
The Success Equation
Success = a · Skill + (1 − a) · Luck
where a in [0, 1] equals the relative weight on skill.
If we can assign relative weights to skill and luck, perhaps by using a
regression if we had data, we could use the model to predict outcomes. If
the manager of a team of recreational vehicle salespeople finds that
success, measured in sales, has a large luck component, he would expect
regression to the mean: salespeople who did well this month would be
likely to be about average the next month. The manager could then use the
model to guide action. He might not want to match a higher salary offer
from a competitor for a salesperson who had two good months in a row. If
instead the regressions showed that luck played almost no role,
performance in two months would be a good predictor of performance in
future months. In this case, the manager would want to match an outside
offer for the best salesperson.
The same insight applies to CEO pay. A board of directors should not pay
bonuses to CEOs who work in industries where luck determines success.
An oil company’s profits depend on the market price of crude oil, a variable
that lies outside the company’s control. An oil company’s board should
therefore be reluctant to reward a CEO for a good year. An advertising
company would be wise to do the opposite—to award a large bonus to a
CEO if the company performs well. In brief, pay for skill; do not pay for
luck. Better-run corporations do in fact pay less for luck.5
Even the simplest of models, such as this one, produce subtle insights. By
thinking about the equation, we see that even in a context that depends
almost entirely on skill, such as running, biking, swimming, chess, or
tennis, if skill differences are small, luck largely determines who wins. We
might expect that in the most competitive environments, like the Olympics,
skill differences are small, and thus luck matters. Mauboussin calls this the
paradox of skill. Michael Phelps, the greatest swimmer in history, has been
on both ends of the paradox. In the 2008 Olympic Games, Phelps trailed
Milorad Cavic at the end of the 100-meter butterfly. Yet by a stroke of luck,
Phelps touched the wall first. In the 2012 Olympic Games, Phelps led Chad
le Clos at the finish, but le Clos touched first. Yes, Phelps has incredible
skill, but that one win and that one loss were the products of luck.
Multiple-Variable Regression
Multiple-variable linear regression models fit linear equations with many
variables and also minimize the total distance to the data. These equations
include coefficients for each independent variable. The equation below
shows a hypothetical regression output for student performance on a math
test as a function of hours studied (HRS), family socioeconomic status
(SES), and the number of accelerated classes (AC).
Math Score = 21.1 + 9.2∗∗ · HRS + 0.8 · SES + 6.9∗ · AC
According to the regression, a student’s score increases by 9.2 points for
every extra hour spent studying. The coefficient has two *’s, so it is
significantly different from zero at a 1% level. This implies strong
correlation, though not causality. The equation also shows that a student
scores almost seven points higher for each accelerated class. That
coefficient is significant as well, but only at the 5% level. Family
socioeconomic status (SES), a variable that takes on values from 1 (low) to
5 (high), has a coefficient that is positive but not significantly different
from zero, so we can assume it probably has little causal effect.
With this or any regression output, we can predict outcomes. The model
predicts that a student who spends seven hours studying and takes one other
accelerated class should score in the 90s. The model can also guide actions,
though we must be cautious, as we cannot infer causality. The data show
that students who study and take accelerated classes perform better. One
reason studying more or taking those classes may not help is selection bias.
It might be that the students who study more and those who take
accelerated classes are better at math.
Even though regressions cannot prove what causes patterns in data, they
can rule out explanations. Take the large wealth disparity by race in the
United States: in 2016, the average wealth of white families (approximately
$110,000) was more than ten times that of African American and Latino
families. Any number of causes might explain that gap, including
institutional factors, differences in income, savings behavior, or marriage
rates. Regressions support some explanations and rule out others. For
example, regressions reveal no significant relationship between marital
status and wealth among African Americans, so marital status cannot be a
cause. Income differences, though substantial, also prove insufficient to
explain the gap.6
The Big Coefficient and New Realities
As already stated, linear regression models play prominent roles in
scientific research, policy analysis, and strategic decision-making, in part
because they are easy to estimate and interpret. With the increased
availability of data, they have become even more widely used. The phrase
“In God we trust. Everyone else must bring data” is often heard in business
and in the halls of government. A reliance on data—and that often means
linear regression models—can steer us toward marginal actions and away
from big new ideas. A business, government, or foundation that gathers
data, fits a linear regression model, and finds the variable with the largest
statistically significant coefficient almost cannot stop itself from adjusting
that variable and taking the marginal gain.
When taking an action, it is better to choose the variable with the big
coefficient than a variable with a small coefficient. At the same time, bigcoefficient thinking builds in conservatism. It focuses attention on certain
modest improvements and pulls attention away from novel policies. A
second problem with big-coefficient thinking is that the magnitude of the
big coefficient corresponds to the marginal effect given existing data.
Often, as we see in the next chapter, effect sizes diminish as we increase the
value of a variable. If so, the big coefficient becomes smaller as we try to
exploit it.
The Big Coefficient vs. the New Reality
Linear regressions reveal the magnitude of correlations of independent
variables with the variable of interest. If that correlation is causal, changes
to the variable with a big coefficient will have large effects. Policies based
on big coefficients guarantee improvements but rule out new realities that
involve more fundamental changes.
The alternative to big-coefficient thinking is new-reality thinking. Bigcoefficient thinking widens roads and builds high-occupancy vehicle lanes
to reduce traffic. New-reality thinking builds train and bus systems. Bigcoefficient thinking subsidizes computers for low-income students. Newreality thinking gives everyone a computer and reduces mail delivery to
three days a week. Big-coefficient thinking changes the width of airline
seats. New-reality thinking creates an airplane interior that can be filled
with interchangeable pods. Big coefficients are good. Evidence-based
action is wise, but we must also keep our eyes open to big new ideas as
well. When we encounter them, we can use models to explore whether they
might work. A regression on teenage traffic accidents may find that age has
the largest coefficient, implying that states might want to raise the driving
age. That may work, but so too might more novel policies such as curfews
that prohibit nighttime driving, automated monitoring of teenage drivers
through smartphones, or limits on the number of passengers in teenagers’
cars. These new-reality policies might produce larger effect sizes than
riding the big coefficient.
Summary
To summarize, linear models posit constant effect sizes. Linear regression
offers a powerful tool for taking a first cut at data, enabling us to identify
the sign, magnitude, and significance of variables. If we want to know the
health effects of coffee, alcohol, or soda consumption, we can run
regressions. We may find that coffee consumption reduces the risks of
cardiovascular disease and that so do modest levels of alcohol
consumption. That said, we should be skeptical of extrapolating linear
effects too far outside of the existing data range. We should not infer that
thirty cups of coffee, much less six glasses of wine, would be a good idea.
Nor should we make linear projections too far ahead in time. California’s
population grew at a rate of 45% from 1880 to 1960. Had we made a linear
projection, we would have pegged California’s population in 2018 at 100
million people, more than double its actual level.
Keep in mind we are just getting started. Most phenomena of interest are
not linear. For that reason, regression models often include nonlinear terms
such as age squared, the square root of age, or even the log of age. To
account for nonlinearities, we can also arrange linear models end to end.
These concatenated linear models can approximate a curve in much the
same way as we can use straight-edged bricks to construct a curved path.
Though linearity may be a strong, and unrealistic, assumption, it offers a
good place to start. If given data, we can use linear models to test our
intuitions. We can then construct more elaborate models in which the effect
of a variable dampens as it increases (diminishing returns) or becomes
more powerful (positive returns). These nonlinear models are the focus of
the next chapter.
Binary Classifications of Data
In an era of Big Data, organizations use algorithms informed by models to
classify their data. A political party might want to learn who votes, an
airline might want to learn the attributes of their frequent flyers, and an
event organizer might want to learn about the event’s attendees. In each
case, the organization classifies people into two sets: those who buy,
contribute, or enroll are labeled as positives (+’s) and those who are not are
labeled as negatives (-’s).
Classification models apply algorithms to partition the people into
categories based on attributes such as a person’s age, income, education
level, or hours spent on the internet. Different algorithms imply different
underlying models of the relationship between attributes and outcomes.
Applying multiple algorithms—using many models—will produce an even
better classification.
Linear classifications: In figure M1, positives (+) represent voters and
negatives (-) represent nonvoters. A linear function of a person’s age and
education level can be used classify whether or not a person votes. The data
show that more educated people are more likely to vote and that older
people are more likely to vote. In this example, a straight line classifies
nearly perfectly.7
image
Figure M1: Using a Linear Model to Classify Voting Behavior
Nonlinear classifications: In figure M2, positives (+) represent frequent
flyers, consumers who fly more than 10,000 miles per year, and negatives
(-) represent all other customers of an airline. People of middle age and
higher income are more likely to fly. To classify these data requires a
nonlinear model, which could be estimated using deep-learning algorithms,
such as neural networks. Neural networks include more variables so that
they can fit almost any curve.
image
Figure M2: Using a Nonlinear Model to Classify Frequent Flyers
Forests of decision trees: In figure M3, positives (+) represent people who
attended a science fiction convention based on their age and the hours per
week they spend on the internet. Here we classify the data using three
decision trees. Decision trees make classifications based on sets of
conditions on the attributes. The figure shows three trees:
Tree 1: If (age < 30) and (internet hours per week in [15, 25])
Tree 2: If (age in [20, 45]) and (internet hours per week > 30)
Tree 3: If (age > 40) and (internet hours per week < 20)
image
Figure M3: A Forest of Decision Trees Classifying Conference Attendees
The collection of trees are called a forest. Machine learning algorithms
create trees randomly on a training set and then keep those that classify
accurately on the testing set and on a training set.