27. Multi-Armed Bandit Problems
There’s one thing I’m really good at, and that’s hitting the ball over a net,
in a box. I’m excellent.
—Serena Williams
In this chapter, we add uncertainty to the problem of learning the best
alternative to create a class of models known as multi-armed bandit
problems. In bandit problems, rewards from alternatives are distributions
rather than fixed amounts. Bandit problems apply to a wide variety of realworld situations. Any choice among actions that has an uncertain payoff—
pharmaceutical drug trials, choice of where to place advertisements, choice
among technologies, decisions as to whether to allow laptops in the
classroom—can be modeled as bandit problems; so too can the problem of
choosing a profession at which we can excel.1
A person facing a bandit problem must experiment with alternatives to
learn the payoff distributions. This feature of bandit problems creates a
trade-off between exploration (searching for the best alternative) and
exploitation (choosing the alternative that has performed best so far).
Finding an optimal balance in the explore-exploit trade-off requires
sophisticated rules and behaviors.2
The chapter consists of two parts followed by a discussion on the value of
applying models. In the first part of the chapter, we describe a special class
of Bernoulli bandit problems, in which each alternative is a Bernoulli urn
with unknown proportions of gray and white balls. We describe and
compare heuristic solutions, and then show how these solutions can
improve comparison tests of drug treatments, advertising plans, and
teaching strategies. In the second part, we describe a more general model,
where the reward distributions can take any form and the decision-maker
has a prior distribution over their types. In that part, we also show how to
solve for the Gittins index, which determines the optimal choice.
Bernoulli Bandit Problems
We begin with a subclass of bandit problems in which each alternative has
a fixed probability of producing a successful outcome. This first class of
bandit problems is equivalent to deciding among a set of Bernoulli urns,
each containing different proportions of gray and white balls. Therefore, we
refer to these as Bernoulli bandit problems. These are also called
frequentist problems because the decision-maker knows nothing about the
distributions. As the decision-maker tries alternatives (explores), she learns
about those distributions.
Bernoulli Bandit Problems
Each of a collection of alternatives {A, B, C, D,…, N} has an unknown
probability of producing a successful outcome, {pA, pB, pC, pD,…, pN}. In
each period, the decision-maker chooses an alternative, K, and receives a
successful outcome with probability pK.
For example, suppose a chimney cleaning company has a list of phone
numbers of recent home buyers. The company tests three sales pitches: the
scheduled appointment approach (“Hello, I’m calling to arrange a time for
your annual chimney cleaning”), the concerned questioner approach
(“Hello, did you know a dirty chimney can be a fire hazard?”), and the
personal touch approach (“Hello, my name is Hildy, and I started this
chimney sweeping company with my father fourteen years ago”).
Each sales pitch has an unknown probability of success. Suppose that the
company first tries the scheduled appointment approach, and it fails. It then
tries the concerned questioner approach and gains a client. That approach
also works on the next call but then fails on the next three calls. After that,
the company tries the third approach, which works on the first call but fails
on the next four. After ten calls, the second approach has the highest
success rate, but the first approach was only tried one time. The decisionmaker faces a choice between exploiting (choosing the alternative that has
worked best) or exploring (returning to the other two alternatives to get
more information). This same problem is faced by a hospital selecting
among surgical procedures and a pharmaceutical company testing various
drug protocols. Each protocol has an unknown probability of success.
To gain insight into the explore-exploit trade-off, we compare two
heuristics. The first, sample-then-greedy, tries each alternative a fixed
number of times, M, and thereafter chooses the alternative with the highest
average payoff. To determine the size of M, we can refer back to the
Bernoulli urn model and the square root rules. The standard deviation of
the mean proportion is bounded above by image. If each alternative is
tested 100 times, the standard deviation of the mean proportion will equal
5%. If we apply a two-standard-deviation rule to identify a significant
difference, we can confidently distinguish between proportions that differ
by 10%. If one alternative produced successful outcomes 70% of the time
and another produced successes 55% of the time, we could place more than
95% confidence on the first being better.
The second heuristic, an adaptive exploration rate heuristic, allocates ten
initial trials to each alternative. The next twenty trials are allocated in
proportions corresponding to the success rates. If in the first ten trials one
alternative produced six successes and the other produced only two, then
the first alternative would receive three-fourths of the next twenty trials.
The second set of twenty trials could also be allocated according to the
ratio of the squared success probabilities. If successes continued in the
same proportions, the better alternative would then receive image, or
90%, of the third set of twenty trials. For each successive set of twenty
trials the exponent of the probabilities could be increased at some rate. By
increasing the rate of exploitation over time, the second algorithm improves
on the first. If one alternative had a much higher probability of success than
another, say 80% to 10%, the algorithm would not waste a hundred trials on
the second alternative. On the other hand, if the two probabilities of success
were close, the algorithm would continue to experiment.3
Adherence to the sample-then-greedy heuristic not only is inefficient, it can
even be unethical. When Robert Bartlett tested an artificial lung, its success
rate far surpassed those of the other alternatives. Continuing to test the
other alternatives when the artificial lung performed best would have
resulted in unnecessary deaths. Bartlett stopped experimenting with the
other alternatives. Everyone was given the artificial lung. In fact, that can
be shown to be an optimal rule: if an alternative is always successful, keep
choosing that alternative. Experimentation can have no value because no
other alternative could perform better.
Bayesian Multi-Armed Bandit Problems
In a Bayesian bandit problem, the decision-maker has prior beliefs over the
reward distributions of the alternatives. Given these prior beliefs, a decision
maker can quantify the trade-off between exploration and exploitation and
(in theory) make optimal decisions in each period. However, except for the
simplest of bandit problems, determining the optimal action requires rather
tedious calculations. In real-world applications, these exact calculations
may be impractical, obliging decision makers to rely on approximations.
Bayesian Multi-Armed Bandit Problems
A collection of alternatives {A, B, C, D,…, N} have associated reward
distributions {f (A), f (B), f (C), f (D),…, f (N)}. The decision-maker has
prior beliefs over each distribution. In each period, the decision-maker
chooses an alternative, receives a reward, and calculates new beliefs based
on the reward.
Determining the optimal action relies on a four-step process. First, we
calculate the expected immediate reward from each alternative. Second, for
each alternative, we update our beliefs about the reward distribution. Third,
based on our new beliefs about reward distributions, we determine the best
possible actions in all subsequent periods based on what we know. Last, we
add the expected reward from the action in the next period to the expected
rewards from the optimal future actions. That sum is known as the Gittins
index. In each period, the optimal action has the largest Gittins index.
Notice that the calculation of the index quantifies the value of exploration.
If we try an alternative, the Gittins index does not equal the expected
reward. It equals the sum of all future rewards assuming we take optimal
actions given what we have learned. Computing a Gittins index is difficult.
For a (relatively) simple example, suppose there exists a safe alternative
that is certain to pay $500 and a risky alternative that with a 10%
probability will always pay $1,000. The remaining 90% of the time, the
risky alternative pays nothing.
To calculate the Gittins index for the risky alternative, we first ask what
could happen: either it always pays $1,000 or it always pays nothing. We
then think through how each outcome would influence our beliefs. If we
knew the risky alternative paid $1,000, we would always choose it. If we
knew that the risky arm paid nothing, we would always choose the safe arm
in the future.
It follows that the Gittins index for the risky arm corresponds to a 10%
probability of a reward of $1,000 in each period and a 90% probability of a
reward of $500 in every period but the first. For a situation in which we get
to choose an alternative many times, this averages out to approximately
$550 per period. The risky alternative is therefore the better choice.4
The Gittins Index: Example
To show how to compute Gittins indices, we consider the following
example with two alternatives. Alternative A produces a certain reward in
{0, 80} with 0 and 80 equally likely. Alternative B produces a certain
reward in {0, 60, 120} with each equally likely. We assume that the
decision-maker wants to maximize reward over ten periods.
Alternative A: With probability image the reward equals 0, so alternative
B, which has an expected reward of 60, will be chosen in all remaining
periods. This produces an expected reward of 540 (9 times 60). With
probability image the reward equals 80. The optimal choice in the second
period even with this outcome is to choose alternative B in the second
period. With probability image, B produces a reward of 120, so the total
payoff equals 1,160 (80 plus 9 times 120). With probability image, B
produces a reward of 60. In that case, alternative A is optimal choice in all
subsequent periods generating a total payoff equal to 780 (60 plus 9 times
80). Finally, with probability image, B produces a reward of 0. In this
case as well, alternative A is optimal choice in all subsequent periods. The
total payoff will 720 (9 times 80).
Combing all three possibilities, it follows that the Gittins index in period
one for alternative A equals the following:
image
Alternative B: With probability image the reward equals 120. If that
occurs, the optimal choice in all future periods will also be B. Over ten
periods the total reward will equal 1,200. If the reward equals zero, then the
optimal choice in all future periods will be alternative A, which has an
expected reward of 40. The expected total reward will equal 360 (9 times
40). If the reward equals 60, then the decision-maker could choose
alternative B in all future periods, for a total return of 600. However, if she
chooses A in the second period, half of the time it will always produce a
reward of 80, for a total return of 780 (60 plus 9 times 80). The other half
of the time it produces a reward of zero, and the optimal choice in all
subsequent periods will be B, which produces a reward of 60, resulting in a
total reward of 540 (9 times 60). It follows that the expected reward from
making optimal choices after choosing A in the second period equals 660 (
image).
Combing all the possibilities, it follows that the Gittins index in period one
for alternative B equals the following:
image
Given these calculations, alternative B is the optimal first period choice.
The optimal long run choice depends upon what is learned in the first
period. If alternative B produces an outcome of 120, we stick with B
forever.
The analysis shows that when taking an action, we care more about the
probability that an alternative will be the best than about its expected
reward. Moreover, if an alternative produces a very high reward, we should
be more likely to select it in the future. In contrast, if it produces an average
reward, even a reward above the expected reward of another alternative, we
may be less likely to stick with the alternative. That is particularly true in
early periods, where we want to look for high-reward alternatives. These
insights hold across the many applications discussed. Provided there are not
risks or high costs associated with actions, the model tells us to explore
potentially high-reward actions even if they are low probability.
Summary
A key takeaway from this book is that by learning models a person can
make better decisions. We can see that in stark relief by comparing what
people should do in the bandit problem with what people actually do. Most
people do not try to estimate a Gittins index when confronted with a bandit
problem. They fail to do so, in part, because they do not keep data. Only
recently, for example, have doctors begun to keep data on the efficacy of
the many procedures—the efficacy of the various types of artificial joints
or, say, the advantages of stents. Without that data, a doctor cannot
determine which action has the highest expected payoff.
Doctors, and everyone else, need data to apply the lessons produced by the
model. So if you wanted to learn whether taking a walk before dinner or
after dinner resulted in better sleep patterns, you would need to keep track
of how well you had slept, and by using a sophisticated heuristic, you could
learn which probably works best. That may seem like a lot of effort. And it
is, but less so now. New technology enables us to gather data on sleep
patterns, pulse rate, weight, and even mood.
Most of us will not gather the necessary data and compute Gittins indices
for life choices like when to exercise. The point is only that we could, and
if we did, we would see improvement in life choices—in our sleep patterns
and general health. Psychologist Seth Roberts performed self-study for
twelve years and found that standing at least eight hours a day improved his
sleep (though he slept less) and that standing along with getting morning
light reduced his upper respiratory infections.5 We may lack his dedication
to self-experimentation. By not keeping data and comparing outcomes, we
may go through life skipping breakfast when we would have been better off
having grapefruit.
On high-stakes business, policy, and medical decisions where data are
easier to gather, applying bandit models is common practice. Businesses,
governments, and nonprofits experiment with alternatives and then exploit
those that perform best. In practice, the alternatives may not remain fixed.
A government mailer to increase participation in a farm subsidy program
may be altered from year to year—say, swapping out a picture of a man
with a picture of a woman.6 This type of continued experimentation can be
captured by the models we take up in the next chapter: rugged-landscape
models.
Presidential Elections
We now apply three models to analyze outcomes in presidential elections:
the spatial model, the category model, and the multi-armed bandit model.
Spatial model: To attract voters, candidates compete in an ideological
issue space. Thus, we should expect candidates to tend toward moderate
positions, elections to be close, and the winning sequence of parties to be
random. Presidential elections are, with a few exceptions, close. To test if
the sequence of winners is random, we first construct a time series of the
thirty-eight winning parties from 1868 through 2016.
RRRRDRDRRRRDDRRRDDDDDRRDDRRDRRRDDRRDDR
We can then measure the block entropy of subsequences of various lengths.
Subsequences of length 1 have entropy 0.98. Subsequences of length 4
have entropy of 3.61. Statistical tests show that we cannot reject that the
sequence is random. For comparison, a random sequence of length 38 had
block 1 entropy of 1.0 and block 4 entropy of 3.58.
Category model: If we think of each state as a category and assume
heterogeneity across states, the spatial model implies that once the
candidates choose initial positions, some states will not be competitive. The
model predicts fierce competition in a handful of moderate states. In 2012,
Obama and Romney spent over 96% of their television advertising budgets
in ten states. Each spent nearly half of their advertising budgets in three
moderate states: Florida, Virginia, and Ohio. In 2016, Clinton and Trump
also spent more than half of their television dollars in three moderate states:
Florida, Ohio, and North Carolina.7
Multi-armed bandit model (retrospective voting): Voters will be more
likely to reelect a party that produces good outcomes. Voting for effective
parties corresponds to pulls of a lever that generate a high payoff. A strong
economy should benefit the incumbent party. Evidence shows that voters
are more likely to reelect the party in power when the economy performs
well. The effect is larger for the incumbent candidate than for a
nonincumbent candidate from the party in power.8