3. The Science of Many Models
Nothing is less real than realism. Details are confusing. It is only by
selection, by elimination, by emphasis that we get to the real meaning of
things.
—Georgia O’Keeffe
In this chapter, we take a scientific approach to motivate the many-model
approach. We begin with the Condorcet jury theorem and the diversity
prediction theorem, which make quantifiable cases for the value of many
models in helping us act, predict, and explain. These theorems may
overstate the case for many models. To show why, we introduce
categorization models, which partition the world into boxes. Using
categorization models shows us that constructing many models may be
harder than we expect. We then apply this same class of model to discuss
model granularity—how specific our models should be—and help us
decide whether to use one big model or many small models. The choice
will depend on the use. When predicting, we often want to go big. When
explaining, smaller is better.
The conclusion addresses a lingering concern. Many-model thinking might
seem to require learning a lot of models. While we must learn some
models, we need not learn as many as you might think. We do not need to
master a hundred models, or even fifty, because models possess a one-tomany property. We can apply any one model to many cases by reassigning
names and identifiers and modifying assumptions. This property of models
offers a counterpoise to the demands of many-model thinking. Applying a
model in a new domain requires creativity, an openness of mind, and
skepticism. We must recognize that not every model will appropriate to
every task. If a model cannot explain, predict, or help us reason, we must
set it aside.
The skills required to excel at one-to-many differ from the mathematical
and analytic talents many people think of as necessary for being a good
modeler. The process of one-to-many involves creativity. It is to ask: How
many uses can I think of for a random walk? To provide a hint of the forms
that creativity takes, at the end of the chapter we apply the geometric
formula for area and volume as a model and use it to explain the size of
supertankers, to criticize the body mass index, to predict the scaling of
metabolisms, and to explain why we see so few women CEOs.
Many Models as Independent Lies
We now turn to formal models that help reveal the benefits of many-model
thinking. Within those models, we describe two theorems: the Condorcet
jury theorem and the diversity prediction theorem. The Condorcet jury
theorem is derived from a model constructed to explain the advantages of
majority rule. In the model, jurors make binary decisions of guilt or
innocence. Each juror is correct more often than not. In order to apply the
theorem to collections of models instead of jurors, we interpret each juror’s
decision as a classification by a model. These classifications could be
actions (buy or sell) or predictions (Democratic or Republican winner). The
theorem then tells us that by constructing multiple models and using
majority rule we will be more accurate than if we used one of the
constituent models. The model relies on the concept of a state of the world,
a full description of all relevant information. For a jury, the state of the
world consists of the evidence presented at trial. For models that measure
the social contribution of a charitable project, the state of the world might
correspond to the project’s team, the organizational structure, the
operational plan, and the characteristics of the problem or situation the
project would address.
Condorcet Jury Theorem
Each of an odd number of people (models) classifies an unknown state of
the world as either true or false. Each person (model) classifies correctly
with a probability p > image, and the probability that any person (model)
classifies correctly is statistically independent of the correctness of any
other person (model).
Condorcet jury theorem: A majority vote classifies correctly with higher
probability than any person (model), and as the number of people (models)
becomes large, the accuracy of the majority vote approaches 100%.
Ecologist Richard Levins elaborates on how the logic of the theorem
applies to the many-model approach: “Therefore, we attempt to treat the
same problem with several alternative models each with different
simplifications but with a common biological assumption. Then, if these
models, despite their different assumptions, lead to similar results, we have
what we can call a robust theorem, which is relatively free of the details of
the model. Hence our truth is the intersection of independent lies.”1 Note
that here he aspires to a unanimity of classification. When many models
make a common classification, our confidence should soar.
Our next theorem, the diversity prediction theorem, applies to models that
make numerical predictions or valuations. It quantifies the contributions of
model accuracy and model diversity to the accuracy of the average of those
models.2
Diversity Prediction Theorem
Many-Model Error = Average-Model Error − Diversity of Model
Predictions
image
where Mi equals model i’s prediction, image equals the average of the
model’s values, and V equals the true value.
The diversity prediction theorem describes a mathematical identity. We
need not test it. It always holds. Here is an example. Two models predict
the number of Oscars a film will be awarded. One model predicts two
Oscars, and the other predicts eight. The average of the two models’
predictions—the many-model prediction—equals five. If, as it turns out,
the film wins four Oscars, the first model’s error equals 4 (2 squared), the
second model’s error equals 16 (4 squared), and the many-model error
equals 1. The diversity of the models’ predictions equals 9 because each
differs from the mean prediction by 3. The diversity prediction theorem can
then be expressed as follows: 1 (the many-model error) = 10 (the averagemodel error) − 9 (the diversity of the predictive models).
The logic of the theorem relies on opposite types of errors (pluses and
minuses) canceling each other out. If one model predicts a value that is too
high and another model predicts a value that is too low, then the models
exhibit predictive diversity. The two errors cancel, and the average of the
models will be more accurate than either model by itself. Even if both
predict values that are too high, the error of the average of those predictions
will still not be worse than the average error of the two high predictions.
The theorem does not imply that any collection of diverse models will be
accurate. If all of the models share a common bias, their average will also
contain that bias. The theorem does imply that any collection of diverse
models (or people) will be more accurate than its average member, a
phenomenon referred to as the wisdom of crowds. That mathematical fact
explains the success of ensemble methods in computer science that average
multiple classifications as well as evidence that individuals who think using
multiple models and frameworks predict with higher accuracy than people
who use single models. Any single way of looking at the world leaves out
details and makes us prone to blind spots. Single-model thinkers are less
likely to anticipate large events, such as market collapses or the Arab
Spring of 2011.3
These two theorems make a compelling case for using many models, at
least in the context of prediction. The case may be too compelling,
however. The Condorcet jury theorem implies that with enough models, we
would almost never make a mistake. The diversity prediction theorem
implies that if we could construct a diverse set of moderately accurate
predictive models, we can reduce our many-model error to near zero. As
we see next, our ability to construct many diverse models has limits.
Categorization Models
To demonstrate why the two theorems may overstate the case, we rely on
categorization models. These models provide micro-foundations for the
Condorcet jury theorem. Categorization models partition the states of the
world into disjoint boxes. Such models date to antiquity. In The Categories,
Aristotle defined ten attributes that could be used to partition the world.
These included substance, quantity, location, and positioning. Each
combination of attributes would create a distinct category.
We use categories any time we use a common noun. “Pants” is a category;
so are “dogs,” “spoons,” “fireplaces,” and “summer vacations.” We use
categories to guide actions. We categorize restaurants by ethnicity—Italian,
French, Turkish, or Korean—to decide where to have lunch. We categorize
stocks by their price-to-earnings ratios and sell stocks with low price-toearnings ratios. We use categories to explain, as when we claim that
Arizona’s population has grown because the state has good weather. We
also use categories to predict: we might forecast that a candidate for
political office with military experience has an increased chance of
winning.
We can interpret the contributions of categorization models within the
wisdom hierarchy. The objects constitute the data. Binning the objects into
categories creates information. The assigning of valuations to categories
requires knowledge. To critique the Condorcet jury theorem, we rely on a
binary categorization model that partitions the objects or states into two
categories, one labeled “guilty” and one “innocent.” The key insight will be
that the number of relevant attributes constrains the number of distinct
categorizations, and therefore the number of useful models.
Categorization Models
There exists a set of objects or states of the world, each defined by a set of
attributes and each with a value. A categorization model, M, partitions
these objects or states into a finite set of categories {S1, S2,…, Sn} based on
the object’s attributes and assigns valuations {M1, M2,…, Mn} for each
category.
Imagine we have one hundred student loan applications, half of which were
paid back and half of which were defaulted. We know two pieces of
information for each loan: whether the loan amount exceeded $50,000, and
whether the recipient majored in engineering or the liberal arts. These are
the two attributes. With two attributes we can distinguish between four
types of loans: large loans to engineers, small loans to engineers, large
loans to liberal arts majors, and small loans to liberal arts majors.
A binary categorization model classifies each of these four types as either
repaid or defaulted. One model might classify small loans as repaid and
large loans as defaulted. Another model might classify loans to engineers as
repaid and loans to liberal arts majors as defaulted. It seems plausible that
each of these models could be correct more than half the time, and that the
two models might be approximately independent of each other. A problem
arises when we try to construct more models. There exist only sixteen
unique models that map four categories into two outcomes. Two of those
models classify all loans as repaid or defaulted. Each of the remaining
fourteen has an exact opposite. Whenever the model classifies correctly, its
opposite model classifies incorrectly. Thus, of the fourteen possible models,
at most seven can be correct more than half the time. And if any model
happens to be correct exactly half of the time, then so must its opposite.
The dimensionality of our data limits the number of models we can
produce. At most we can have seven models. We cannot construct eleven
independent models, much less seventy-seven. Even if we had higherdimensional data—say, if we knew the recipient’s age, grade point average,
income, marital status, and address—the categorizations that relied on
those attributes must yield accurate predictions. Each subset of attributes
would have to be relevant to whether the loan was repaid and be
uncorrelated with the other attributes. Both are strong assumptions. For
example, if address, marital status, and income are correlated, then models
that swap those attributes will be correlated as well.4 In the stark
probabilistic model, independence seemed reasonable: different models
make independent mistakes. When we unpack that logic with
categorization models, we see the difficulty of constructing multiple
independent models.
Attempts to construct a collection of diverse, accurate models encounter a
similar problem. Suppose that we want to build an ensemble of
categorization models that predict unemployment rates across five hundred
mid-size cities. An accurate model must partition cities into categories such
that within a category the cities have similar unemployment rates. The
model must also predict unemployment accurately for each category. For
two models to make diverse predictions, they must categorize cities
differently, predict differently, or do both. Those two criteria, though not in
contradiction, can be difficult to satisfy. If one categorization relies on
average education level and a second relies on average income, they may
categorize similarly. If so, the two models will be accurate but not diverse.
Creating twenty-six categories using the first letter of each city’s name will
create a diverse categorization but probably not an accurate model. Here as
well, the takeaway is that in practice “many” may be closer to five than
fifty.
Empirical studies of prediction align with that inference. While adding
models improves accuracy (they have to, given the theorems), the marginal
contribution of each model falls off after a handful of models. Google
found that using one interviewer to evaluate job candidates (instead of
picking at random) increases the probability of an above-average hire from
50% to 74%, adding a second interviewer increases the probability to 81%,
adding a third raises it to 84%, and using a fourth lifts it to 86%. Using
twenty interviewers only increases the probability to a little over 90%. That
evidence suggests a limit to the number of relevant ways of looking at a
potential hire.
A similar finding holds for an evaluation of tens of thousands of forecasts
by economists regarding unemployment, growth, and inflation. In this case,
we should think of the economists as models. Adding a second economist
improves the accuracy of the prediction by about 8%, two more increase it
by 12%, and three more by 15%. Ten economists improve the accuracy by
about 19%. Incidentally, the best economist is only about 9% better than
average—assuming you knew which economist was best. So three random
economists perform better than the best one.5 Another reason for averaging
many and not relying on the economist who has been best historically is
that the world changes. The economist who performs at the top today may
be middling tomorrow. That same logic explains why the US Federal
Reserve relies on an ensemble of economic models rather than just one: the
average of many models will typically be better than the best model.
The lesson should be clear: if we can construct multiple diverse, accurate
models, then we can make very accurate predictions and valuations and
choose good actions. The theorems validate the logic of many-model
thinking. What the theorems do not do, and cannot do, is construct the
many models that meet their assumptions. In practice, we may find that we
can construct three or maybe five good models. If so, that would be great.
We need only read back one paragraph: adding a second model yields an
8% improvement, while adding a third gets us to 15%. Keep in mind, these
second and third models need not be better than the first model. They could
be worse. If they are a little less accurate, but categorically (in the literal
sense) different, they should be added to the mix.
One Big Model and the Granularity Question
Many models work in theory and in practice. That does not mean that they
are always the correct approach. Sometimes we are better off constructing a
single large model. In this section, we put some thought into when we
should use each approach and along the way take up the granularity
question of how finely we should partition our data.
To take on the first question, of whether to use one big model or many
small ones, recall the uses of models: to reason, explain, design,
communicate, act, predict, and explore. Four of these uses—to reason,
explain, communicate, and explore—require simplification. By
simplifying, we can apply logic allowing us to explain phenomena,
communicate our ideas, and explore possibilities.
Think back to the Condorcet jury theorem. Within it, we could unpack
logic, explain why an approach that uses many models was more likely to
produce a correct result, and communicate our findings. Had we
constructed a model of jurors with personality types and described the
evidence as vectors of words, we would have been lost in a mangle of
detail. Borges elaborates on this point in an essay on science. He describes
mapmakers who make ever more elaborate maps: “The Cartographers
Guilds struck a Map of the Empire whose size was that of the Empire, and
which coincided point for point with it. The following Generations, who
were not so fond of the Study of Cartography as their Forebears had been,
saw that this vast Map was useless.”
The three other uses of models—to predict, design, and act—can benefit
from high-fidelity models. If we have BIG data, we should use it. As a rule
of thumb, the more data we have, the more granular we should make our
model. This can be shown by using categorization models to structure our
thinking. Suppose first that we want to construct a model to explain
variation in a data set. To provide context, suppose that we have an
enormous data set from a chain of grocery stores detailing monthly
spending on food for several million households. These households differ
in the amount they spend, which we measure as variation: the sum of the
squared differences between what each family spends and average spending
across all households. If average spending is $500 a month and a given
family spends $520, that family contributes 400, or 20 squared, to the total
variation. Statisticians call the proportion of the variation that a model
explains the model’s R2.
If the data had a total variation of 1 billion and a model explains 800
million of that variation, then the model has an R2 of 0.8. The amount of
variation explained corresponds to how much the model improves on the
mean estimate. If the model estimates that a household will spend $600 and
the household in fact spent $600, then the model explains all 10,000 that
the household contributes to total variation. If the household spent $800
and the model says $700, then what had been a contribution of 90,000 to
total variation ((800−500)2) is now only a 10,000 contribution ((800 −
700)2). The model explains image of the variation.
R2: Percentage of Variance Explained
image
where V (x) equals the value of x in X, image equals the average value,
and M(x) equals the model’s valuation.
In this context, a categorization model would partition the households into
categories and estimate a value for each category. A more granular model
would create more categories. This may require considering more attributes
of the households to create those categories. As we add more categories, we
can explain more of the variation, but we can go too far. If we follow the
example of Borges’s mapmakers and place each household in its own
category, we can explain all of the variation. That explanation, like the lifesized map, would not be of much use.
Creating too many categories overfits the data, overfitting undermines
prediction of future events. Suppose that we want to use last month’s data
on grocery purchases to predict this month’s data. Households vary in their
monthly spending. A model that places each household in its own category
would predict that each household spends the same as in the previous
month. That would not be a good predictor given monthly fluctuations in
spending. By placing the household into a category with other similar
households, we can use the average spending on groceries for similar
households to create a more accurate predictor.
To do this, we think of each household’s monthly purchases as a draw from
a distribution (we will cover distributions in Chapter 5). That distribution
has a mean and a variance. The objective in creating a categorization model
is to construct categories based on attributes so that the households within
the same category have similar means. If we can do that, one household’s
spending in the first month tells us about the other households’ spending in
the second month. No categorization will be perfect. The means of
households within each category will differ by a little. We call this
categorization error.
As we make larger categories, we increase categorization error, as we are
more likely to clump households with different means into the same
category. However, these larger categories rely on more data, so our
estimates of the means in each category will be more accurate (see the
square root rules in Chapter 5). The error from misestimating the mean is
called the valuation error. Valuation error decreases as we make categories
larger. One or even ten houses per category will not give an accurate
estimate of the mean if households vary substantially in their monthly
spending. A thousand households will.
We now have the key intuition: increasing the number of categories
decreases the categorization error from binning households with different
means into the same category. Statisticians call this model bias. However,
making more categories increases the error from estimating the mean
within each category. Statisticians refer to this as increasing the variance of
the mean. The trade-off in how many categories to create can be expressed
formally in the model error decomposition theorem. Statisticians refer to
the result as the bias-variance trade-off.
Model Error Decomposition Theorem
The Bias-Variance Trade-off
Model Error = Categorization Error + Valuation Error
image
where M(x) and Mi denote the model’s values for data point x and category
Si and V(x) and Vi denote their true values.6
One-to-Many
Learning models takes time, effort, and breadth. To reduce those demands,
we take a one-to-many approach. We advocate mastering a modest number
of flexible models and applying them creatively. We use a model from
epidemiology to understand the diffusion of seed corn, Facebook, crime,
and pop stars. We apply a model of signaling to advertising, marriage,
peacock feathers, and insurance premiums. And we apply a ruggedlandscape model of evolutionary adaption to explain why humans lack
blowholes. Of course, we cannot take any model and apply it to any
context, but most models are flexible. We gain even when we fail because
attempts at creative uses of models reveal their limits. And it is fun.
The one-to-many approach is relatively new. In the past, models belonged
to specific disciplines. Economists had models of supply and demand,
monopolistic competition, and economic growth; political scientists had
models of electoral competition; ecologists had models of speciation and
replication; and physicists had models describing laws of motion. All of
these models were developed with specific purposes in mind. One would
not apply a model from physics to the economy or a model from economics
to the brain any more than one would use a sewing machine to repair a
leaky pipe.
Taking models out of their disciplinary silos and practicing one-to-many
has produced notable successes. Paul Samuelson reinterpreted models from
physics to explain how markets attain equilibria. Anthony Downs applied a
model of ice cream vendors competing on a beach to explain the
positioning of political candidates competing in ideological space. Social
scientists have applied models of interacting particles to explain poverty
traps, variation in crime rates, and even economic growth across countries.
And economists have taken models of self-control based on economic
principles to understand the functioning of the brain.7
One-to-Many: Higher Powers (XN)
Creatively applying models requires practice. To provide a preview of the
potential of the many-to-one principle, we take the familiar formula of a
variable raised to a power, XN, and apply it as a model. When the power
equals 2, the formula gives the area of a square, when the power equals 3, it
gives the volume of a cube. When raised to higher powers, it captures
geometric expansion or decay.
Supertankers: Our first application considers a cubic supertanker whose
length is eight times its depth and width, which we denote by S. As shown
in figure 3.1, the supertanker has a surface area of 34S2 and a volume of
8S3. The cost of building a supertanker depends primarily on its surface
area, which determines the amount of steel used. The amount of revenue a
supertanker generates depends on its volume. Computing the ratio of
volume to surface area,
increasing size.
image, reveals a linear gain in profitability from
image
Figure 3.1: A Cubic Supertanker: Surface Area = 34S2, Volume = 8S3
Shipping magnate Stavros Niarchos, who knew this ratio, built the first
modern supertankers and made billions during the period of rebuilding that
followed World War II. To give some sense of scale: the T2 oil tanker used
during World War II measured 500 feet long, 25 feet deep, and 50 feet
wide. Modern supertankers such as the Knock Nevis measure 1,500 feet
long, 80 feet deep, and 180 feet wide. Imagine tipping the Willis (Sears)
Tower in Chicago on its side and floating it in Lake Michigan. The Knock
Nevis resembles a T2 oil tanker scaled up by a factor of a little over three.
The Knock Nevis has about ten times the surface area as a T2 oil tanker and
over thirty times the volume. A question arises as to why supertankers are
not even larger. The short answer is that tankers must pass through the Suez
Canal; the Knock Nevis squeezes through with a gap of a few feet on each
side.8
Body mass index: Body mass index (BMI) is used by the medical
profession to define weight categories. Developed in England, BMI equals
the ratio of a person’s weight (in kilograms) to her height in meters
squared.9 Holding height constant, BMI increases linearly with weight. If
one person weighs 20% more than another person of the same height, the
first person’s BMI will be 20% higher.
We first apply our model to approximate a person as a perfect cube made
up of some mixture of fat, muscle, and bone. Let M denote the weight of
one cubic meter of our cubic person. The human cube’s weight equals its
volume times the weight per cubic meter, or H3 · M. Our cube’s BMI equals
H · M. Our model reveals two flaws: BMI increases linearly with height,
and given that muscle weighs more than fat, fit people have higher M and
therefore higher BMIs. Height should be unrelated to obesity, and
muscularity is the opposite of fatness. These flaws remain if we make the
model more realistic. If we make a person’s depth (thickness front to back)
and width proportional to height using parameters d and w, then BMI can
be written as follows: image The BMIs of many NBA stars and other
athletes place them in the overweight category (BMI > 25), along with
many of the world’s top male decathletes.10 Given that even moderately
tall, physically fit people will likely have high BMIs, we should not be
surprised that a meta-analysis of nearly a hundred studies with a combined
sample size in the millions found that slightly overweight people live
longest.11
Metabolic rates: We now apply our model to predict an inverse
relationship between an animal’s size and its metabolic rate. Every living
entity has a metabolism, a repeated sequence of chemical reactions that
breaks down organic matter and transforms it into energy. An organism’s
metabolic rate, measured in calories, equals the amount of energy needed to
remain alive. If we construct cubic models of a mouse and an elephant,
figure 3.2 shows that the smaller cube has a much larger ratio of surface
area to volume.
image
Figure 3.2: The Exploding Elephant
We can model the mouse and the elephant as composed of cells 1 cubic
inch in volume, each with a metabolism. Those metabolic reactions
produce heat that must dissipate through the surface of the animal. Our
mouse has a surface area of 14 square inches and a volume of 3 cubic
inches, a surface-to-volume ratio of roughly 5:1.12 For each cubic-inch cell
in its volume, the mouse has five square inches of surface area through
which it can dissipate heat. Each heat-producing cell in the elephant has
only one-fifteenth of a square inch of surface area. The mouse can dissipate
heat at seventy-five times the rate of the elephant.
For both animals to maintain the same internal temperature, the elephant
must have a slower metabolism. It does. An elephant with a mouse’s
metabolism would require 15,000 pounds of food per day. The elephant’s
cells would also produce too much heat to be dissipated through its skin. As
a result, elephants would smolder and then explode. The reason elephants
do not blow up is that they have a metabolism roughly twenty times lower
than that of mice. The model does not predict the rate at which metabolism
scales with size, only the direction. More elaborate models can explain the
scaling laws.13
Women CEOs: For our last application, we increase the exponent in the
formula and use the model to explain why so few women become CEOs. In
2016, fewer than 5% of Fortune 500 companies had women CEOs. To
become a CEO a person must receive multiple promotions. We can model
those promotion opportunities as probabilistic events: a person has some
probability of receiving a promotion. We further assume that to become
CEO, a person must be promoted at each opportunity.
We assume fifteen promotion opportunities as a benchmark, as that
corresponds to a promotion every two years on a thirty-year path to CEO.
The weight of evidence reveals modest biases in favor of men, which we
can model as men having a higher probability of being promoted.14 We
model this as a man’s probability of promotion, PM, being slightly larger
than a woman’s, PW. If we benchmark these probabilities at 50% and 40%,
respectively, then a man is nearly thirty times more likely than a woman to
become CEO.15 The model reveals how modest biases accumulate. A 10%
difference in promotion rates becomes a 30-fold bias at the top. This same
model provides a novel explanation for why a much larger percentage
(about 25%) of college and university presidents are women. Colleges and
universities have fewer administrative layers than Fortune 500 companies.
A professor can become president in as few as three promotions:
department chair, dean, and then president. Less bias accumulates over
three levels. Thus, the larger proportion of women presidents need not
imply that educational institutions are more egalitarian than corporations.
Summary
We began the chapter by laying logical foundations for the many-to-one
approach using the Condorcet jury theorem and the diversity prediction
theorem. We then used categorization models to show the limits of model
diversity. We saw how many models can improve our abilities to predict,
act, design, and so on. We also saw that it is not easy to come up with many
diverse models. If we could, then we could predict with near perfect
accuracy, which we know we cannot. Nevertheless, our goal will be to
construct as many useful, diverse models as possible.
In the chapters that follow, we describe a core set of models. Those models
make salient different parts of the world. They make different assumptions
about causal interactions. Through their diversity they create the potential
for productive many-model thinking. By emphasizing distinct parts of more
complex wholes, each model contributes on its own. Each also can be part
of an even more powerful ensemble of models.
As noted earlier, many-model thinking does require that we know more
than one model. However, we need not know a huge number of models, so
long as we can apply each model that we do know in multiple domains.
That will not always be easy. Successful one-to-many thinking depends on
creatively tweaking assumptions and constructing novel analogies in order
to apply a model developed for one purpose in a new context. Thus,
becoming a many-model thinker demands more than mathematical
competence; it requires creativity as was evident in our many applications
of our model of a cube.
Bagging and Many Models
Often we fit a model to a sample from an existing data set and then test that
same model against the remainder of the data. Other times we fit a model to
existing data and use that model to predict future data. This type of
modeling creates a tension: the more parameters we include in our model,
the better we can fit data and the more we risk overfitting. Good fit does
not imply a good model. Physicist Freeman Dyson tells of Enrico Fermi’s
reaction to a piece of Dyson’s research that had exceptional model fit. “In
desperation I asked Fermi whether he was not impressed by the agreement
between our calculated numbers and his measured numbers. He replied,
‘How many arbitrary parameters did you use for your calculations?’ I
thought for a moment about our cut-off procedures and said, ‘Four.’ He
said, ‘I remember my friend Johnny von Neumann used to say, with four
parameters I can fit an elephant, and with five I can make him wiggle his
trunk.’ With that, the conversation was over.”16
The estimates used to “wiggle the trunk” often include higher-order terms:
squares, cubes, and fourth powers. This introduces a risk of large errors,
because higher-order terms amplify. While 10 is twice as large as 5, 104 is
16 times as large as 54. The figure below shows an example of overfitting.
image
Overfitting and Out-of-Sample Error
The graph on the left shows (hypothetical) sales data from a company that
manufactures industrial 3-D printers as a function of the number of site
visits made (on average) per month by their sales team. The graph on the
left shows a nonlinear best fit that includes nonlinear terms up to the fifth
power. The graph on the right shows that the model predicts sales of 100
printers if sales visits reach 30. That cannot be correct if customers buy at
most one 3-D printer. By overfitting, the model makes a huge error out of
the sample.
To prevent overfitting, we could avoid higher-order terms. A more
sophisticated solution known as bootstrap aggregation or bagging
constructs many models. To bootstrap a data set, we create multiple data
sets of equal size by randomly drawing data points from the original data.
The points are drawn with replacement—after we draw a data point, we put
it back in the “bag” so that we might draw it again. This technique
produces a collection of data sets of equal size, each of which contains
multiple copies of some data points and no copies of others.
We then fit (nonlinear) models to each data set, resulting in multiple
models.17 We can then plot all the models on the same set of axes, creating
a spaghetti graph (see below). The dark line shows the average of the
different models.
image
Bootstrapping and a Spaghetti Graph
Bagging will capture robust nonlinear effects, as they will be evident in
multiple random samples of the data, while avoiding fitting idiosyncratic
patterns in any single data set. By building diversity through random
samples and then averaging the many models, bagging applies the logic
that underpins the diversity prediction theorem. It creates diverse models,
and as we know, the average of those models will be more accurate than the
models themselves.