6. Power-Law Distributions: Long Tails
Every fundamental law has exceptions. But you still need the law or else all
you have is observations that don’t make sense. And that’s not science.
That’s just taking notes.
—Geoffrey West
In this chapter, we cover power-law distributions. Often described as longor heavy-tailed distributions, when graphed these distributions produce a
long tail running along the horizontal axis corresponding to large events.
The distributions of city populations, species extinctions, the number of
links on the World Wide Web, and firm sizes all have long tails, as do the
distributions of videos downloaded, books sold, academic citations, war
casualties, and floods and earthquakes. In other words, all of these
distributions include large events: Tokyo has 33 million residents, J. K.
Rowling’s Harry Potter books have sold in the neighborhood of half a
billion copies, and the great Mississippi flood of 1927 covered an area
larger than the state of West Virginia under thirty feet of water.1
Contemplating a power-law distribution of human heights reveals how
much power-law distributions differ from normal distributions. If human
heights were distributed by a power law similar to that of city populations,
and if we calibrate the mean height at 5 feet 9 inches, then the United States
would include one person the height of the Empire State Building, over
10,000 people taller than giraffes, and 180 million people less than 7 inches
tall.2
To produce a long-tailed distribution requires non-independence, often in
the form of positive feedbacks.3 Book sales, forest fires, and city
populations, unlike trips to the grocery store, are not independent. When
one person buys a Harry Potter book, she induces others to buy it. When a
single tree catches on fire, that fire can spread to neighboring trees. When a
city increases in population, it adds amenities and job opportunities,
making it more attractive to others. The sociologist Robert Merton referred
to the tendency for those who have more to also receive more as the
Matthew effect: “For unto every one that hath shall be given, and he shall
have abundance: but from him that hath not shall be taken even that which
he hath” (Matthew 25:29).
Given the variety of domains in which we find power-law distributions, it
would be remarkable if a single mechanism could explain them all, and
none does. It would be even more remarkable if each instance of a powerlaw distribution had a unique explanation. That is also not true. Instead, we
possess a collection of distinct models that produce power laws, each
capable of explaining different phenomena.
In this chapter, we focus on two models: the preferential attachment model,
which explains city sizes, book sales, and web links, and the self-organized
criticality model, which explains traffic jams and war deaths, as well as
earthquake, fire, and avalanche sizes. In Chapter 12 when we cover
entropy, we learn a third model in which a power-law maximizes
uncertainty given a fixed mean. And in Chapter 13, we show that return
times in a random walk model also satisfy a power law. Still other models
show that power laws result from optimal encodings, random stopping
rules, and combining distributions.4 The remainder of the chapter covers
the structure, logic, and functions of power-law distributions, followed by a
discussion. The discussion reconsiders the implications of large events and
describes the limits of our ability to prevent and plan for them.
Power Laws: Structure
In a power-law distribution, the probability of an event is proportional to its
size raised to a negative exponent. So for example, the familiar function
image describes a power law. In a power-law distribution, the probability
of an event is inversely related to its size: the larger the event, the less
likely it occurs. Power-law distributions, therefore, have many more small
events than large ones.
Power-Law Distributions
A power-law distribution5 defined over the interval [xmin, ∞) can be
written as follows:
p(x) = Cx-a
where the exponent a > 1 determines the length of the tail, and the
constant term image ensures the distribution has a total probability of
one.
The size of the power law’s exponent determines the likelihood and size of
large events. When the exponent equals 2, the probability of an event is
proportional to the square of its size. An event of size 100 occurs with
probability proportional to image, or 1 in 10,000. When the exponent
increases to 3, the probability of that same event is proportional to image.
For exponents of 2 or less, a power-law distribution lacks a well-defined
mean. The mean of data drawn from a power-law distribution with an
exponent of 1.5 never converges. It increases without limit.
Figure 6.1 shows an approximate graph of the distribution of the number of
links to webpages on the World Wide Web.
image
Figure 6.1: Approximate Power-Law Distribution of Webpage Links
The potential for large events distinguishes power-law distributions from
normal distributions, from which we practically never see large events. For
a long-tailed distribution, though rare, they occur at sufficient frequency to
merit attention and preparation. Even one-in-a-million events are worth
considering. For example, earthquake sizes approximately satisfy a power
law with exponent near two. Suppose that for a region an earthquake larger
than size 9.0 on the Richter scale, the size of an earthquake that topples
buildings and changes the local topography, occurs each day with a
probability of one-in-a-million. Within a century, an earthquake of that size
would occur with probability 3.5%.6
To see the difference between the probabilities of one-in-a-million events in
normal and long-tailed distributions, we can use the distribution of deaths
due to terrorist attacks, which follow a power-law distribution with an
exponent of 2.7 A one-in-a-million event consists of nearly 800 deaths. If
deaths due to terrorist attacks followed a normal distribution with mean 20
and a standard deviation of 5, a one-in-a-million event would involve fewer
than 50 deaths.
A power-law distribution has a precise definition. Not all long-tailed
distributions are power laws. Plotting a distribution on a log-log scale
creates a crude test of whether the distribution is a power law. A log-log
plot transforms event sizes and their probabilities to their logged values and
transforms a power-law distribution into a straight line.8
image
Figure 6.2: Power Law (Black) vs. Lognormal (Gray) on Log-Log Scale
In other words, a straight line on a log-log plot is evidence of a power law,
while an initially straight line that gradually falls off is consistent with a
lognormal (or an exponential) distribution. The rate at which a lognormal
distribution curves downward depends on the variation of the variables that
produce the distribution.9 As we increase the variance in a lognormal
distribution, the tail increases, making it closer to linear on a log-log plot.10
The special case of power laws with exponents equal to 2 are known as Zipf
distributions. For power laws with exponents of two, an event’s rank times
its probability will equal a constant, a regularity known as Zipf’s Law.
Words satisfy Zipf’s Law. The most common English word, the, occurs 7%
of the time. The second most common word, of, occurs 3.5% of the time.
Notice that its rank, 2, times its frequency of 3.5% equals 7%.11
Zipf’s Law
For power-law distributions with an exponent of 2 (a = 2), the rank of an
event times its size equals a constant.
Event Rank · Event Size = Constant
The populations of cities in many countries, including the United States,
are distributed approximately in this way. Using 2016 city population data,
each city’s rank multiplied by its population produces a value near 8
million.
image
Models That Produce Power Laws: Logic
We now turn to models that produce power laws. Lacking models, powerlaw distributions remain unexplained patterns.
Our first model, the preferential attachment model, assumes entities that
grow at rates relative to their proportions. It captures Merton’s Matthew
effect: more begets more. The model considers a population that grows
through arrivals. A new arrival either joins an existing entity or creates a
new one. If the latter, the probability of joining an existing entity is
proportional to the size of that entity.
Preferential Attachment Model
A sequence of objects (people) arrive one after another. The first arrival
creates an entity. Each subsequent arrival applies the following rule: With
probability p (small), the arrival forms a new entity. With probability (1−p),
the arrival joins an existing entity. The probability of joining a particular
entity equals its size divided by the number of arrivals to date.
image
Imagine students coming onto a college campus. The first student creates a
new club. With some small probability, the second student creates her own
club. More likely, she joins the first student’s club. The first ten students
might create three clubs: one with seven members, one with two members,
and one containing a single member. The eleventh arrival will, with small
probability, create a fourth club. If not, she will join an existing club. When
joining an existing club, she chooses the club with seven students 70% of
the time, the club with two students 20% of the time, and the club with one
student 10% of the time.
The preferential attachment model helps explain why the distributions of
links on the World Wide Web, city sizes, firm sizes, book sales, and
academic citations are power laws. In each setting, an action (say, a person
buys a book) increases the likelihood others will do the same. If the
probability of buying from a firm is proportional to its current market
share, and if new firms enter at a low rate, then the model predicts that the
distribution of firm sizes will be a power law. The same logic applies to
book sales, music downloads, and city growth.
Our second model, the self-organized criticality model, produces a powerlaw distribution through a process that builds interdependencies in a system
until the systems reaches a critical state. A variety of self-organized
criticality models exist. The sand pile model assumes that someone drops
grains of sand onto a table from a spot several feet above. As the grains
accumulate, a pile forms. Eventually, the pile attains a critical state where
additional grains can cause avalanches. At this critical state, additional
grains often have no effect or cause at most a few grains to fall. These are
the many small events in a power-law distribution. Sometimes the addition
of a single grain results in a large avalanche. These are the large events.
A second model, the forest fire model, assumes a two-dimensional grid on
which trees can grow. Trees can also be hit by random lightning strikes.
When the density of the trees is low, any fires caused by lightning will be
small, affecting at most a few cells. When the density of trees is high, a fire
started by a lightning strike will spread across much of the grid.
Self-Organized Criticality: Forest Fire Model
The forest consists initially of an empty N by N grid. Each period a random
site on the grid is chosen. If empty, with probability g the site grows a tree.
If the site contains a tree, with probability (1 − g) lightning hits the site. If
the site contains a tree, the tree catches fire, and the fire spreads to all
connected sites with trees.
Notice that in the forest fire model, the probability of a lightning strike
equals one minus the probability of the growth rate. This construction
allows us to vary the relative rate of growth and lightning. It is a
simplification that reduces the number of parameters in our model.
Experimenting with the growth rate of trees, we find that for growth rates
close to one, the density of trees increases to a critical state: a relatively
dense forest of trees, where lightning strikes can wipe out a huge swath of
forest. At this critical state, the distribution of the sizes of patches in the
forest, and therefore the size of fires, satisfies a power-law distribution.
Moreover, the forest naturally tends to this density level. If it is less dense,
density increases because fires are small. If density exceeds the threshold,
any fire will wipe out the entire forest. Therefore, the tree density selforganizes to a critical state.12
In both the sand pile model and the forest fire model, a macro-level
variable—the height of the pile or the density of the forest—has a critical
value. That macro-level variable’s value decreases when events occur
(avalanches and fires). Variants of this model can explain the distributions
of solar flares, earthquakes, and traffic jams. An increasing macro-level
variable that decreases when events occur, though necessary, is not
sufficient for self-organized criticality. Equilibrium systems also have that
property. Water flows into and out of lakes through streams, yet because
outflows are smooth, lake levels change gradually. The key assumptions for
self-organization to critical states is that pressure increases smoothly, like
water flowing into the lake, and that pressure decreases in bursts, including
possibly large events.
The Implications of Long Tails
We cover three implications of long-tailed distributions: their effect on
equity, catastrophes, and volatility. By definition, a long tail means a few
big winners (large collapses, earthquakes, fires, and traffic jams) and many
losers as compared to a normal distribution, which is symmetric about a
mean. Long-tailed distributions can also contribute to volatility, as random
fluctuations in larger entities will have larger effects.
Equity
A person who writes a better book, catchier song, or better academic paper
than another should garner more sales and credit. It is not equitable if a
person who performs only a little better or who happens to be lucky earns a
lot more. As we saw in the preferential attachment model, positive
feedbacks create big winners due to the Matthew effect. For positive
feedbacks to occur in a market, people must know what others buy, and
people must be able to buy the product. For weightless information goods,
such as smartphone applications, the latter assumption makes perfect sense.
For an iPhone application, no production constraints slow the positive
feedbacks as they do for, say, trucks. Ford can only increase production of
F-150 trucks by so much. In contrast, Intuit can sell as many copies of
TurboTax as people are willing to download.
Empirical studies show that social effects create bigger winners. In the
music lab experiments, college students could sample and download songs.
In the first treatment, subjects did not know what songs others downloaded,
and the distributions of downloads had a shorter tail—no song received
more than two hundred downloads and only one song received fewer than
thirty. In a second treatment, students knew what others downloaded. The
tail of the distribution grew: one song received more than three hundred
downloads. Perhaps more telling, over half received fewer than thirty. The
tail became longer. Social influence increased inequality. This inequality is
not a concern if social influence leads people to download better songs.
However, correlations between downloads in the two treatments were not
strong. If we interpret the number of downloads of a song in the first
treatment as a proxy for the song’s quality, social influence did not result in
people downloading better songs. The big winners were not random, but
they were not the best.13
We must be careful not to draw too strong an inference from a single study.
We can, though, infer that while an author who sells 50 million books or an
academic whose work receives 200,000 citations deserves accolades, such
extreme success suggests that the central limit theorem is not holding.
People are not buying books or making citations independently. Amazing
success probably implies positive feedbacks, and perhaps a bit of luck. We
return to these ideas when discussing the causes of income inequality in the
book’s final chapter.14
Catastrophes
Long-tailed distributions include catastrophic events: earthquakes, fires,
financial collapses, and traffic jams. Even though the models cannot predict
earthquakes, they provide insight into why their distribution satisfies a
power law. That knowledge tells us the likelihood of earthquakes of various
sizes. We know what to expect, if not when.15
The forest fire model does guide action. We can prevent large fires by
selectively harvesting trees in a forest to lower the density of trees. Or we
might build fire-breaks. One could argue that we do not need a model to tell
us to thin a forest or build firebreaks. That is surely true. The model makes
us aware that there exists a critical density. That density may vary by forest.
It could depend on the type of tree, the prevailing wind speeds, and the
topography. The model explains why forests may self-organize to critical
states.
We can also use the model as an analogy. Recall that in Chapter 1 we
discussed the failures of financial institutions across networks. We can
apply the forest fire model to that setting by representing banks and other
financial institutions as trees on a checkerboard and allowing adjacencies to
correspond to outstanding loans. In that model, a bank failure would be
equivalent to a tree catching on fire. That failure could then spread to
neighboring banks.
This naive application of the forest fire model to banks would portend
large-scale failures as banks become more connected. As we explore that
analogy, we see four shortcomings. First, the financial network is not
embedded in physical space. Banks can differ in their number of
connections. One bank may have dozens of financial obligations while
another may have a mere one or two. Second, trees in a forest cannot take
actions to reduce the probability of fire spreading. Banks can. They can
increase their level of reserves. Third, the more connected a bank is, the
less likely that its failure spreads as its losses will be dispersed across more
banks. For example, if a bank defaults on a $100,000 loan borrowed from a
single other bank, that second bank may well go under. If the first bank
borrowed the money from a consortium of twenty-five other banks, no
single bank takes a large hit. The systems may well absorb the default
without collapsing.16 Last, the spread of a failure from one bank to another
depends on the banks’ portfolios. If two connected banks hold similar
portfolios, then if one fails the other probably is likely already weak. The
worst-case scenario occurs if all of the banks in the network hold identical
portfolios. In this case, when one bank fails, widespread failure would be
likely.17 If, though, each bank holds a distinct portfolio, poor performance
by one need not imply poor performance of another. Bank failures may not
spread. A useful model must therefore take into account the assets in the
various portfolios. Without this information, knowing which banks have
obligations to other banks will be insufficient to predict or prevent failures,
and the net effect of greater connectedness of banks will not be clear.
Volatility
Last, we consider a more subtle implication of long-tailed distributions. If
the entities that make up a power-law distribution fluctuate in size, then the
exponent of the power law becomes a proxy for system-level volatility. It
follows that the firm size distribution should influence market volatility.
For this exercise, think of a country’s gross domestic product (GDP) as the
aggregate production of thousands of firms. If production levels are
independent and have finite variation, then, by the central limit theorem,
the distribution of GDP will have a normal distribution. It also follows that
the greater the variation in production levels across firms, the greater the
aggregate volatility. If a longer-tailed distribution of firm sizes produces
greater variation in production levels, then it will also correlate with greater
aggregate volatility.
An examination of volatility patterns in the United States shows that
volatility rose in the 1970s and 1980s and then fell for the next two decades
in what some call the Great Moderation.18 Beginning around 2000,
volatility again increased. It is possible to explain these volatility patterns
by changes in the distribution of firm sizes.19 As the distribution of firm
sizes becomes longer- (shorter-)tailed, the largest firms have a
disproportionally larger (smaller) effect on volatility. In other words,
aggregate volatility increases (decreases) as the firm size distribution
becomes longer-(shorter-)tailed. In 1995, when volatility was low, Walmart
had revenues of $90 billion, which corresponded to 1.2% of GDP. By 2016,
Walmart’s revenues had increased to $480 billion, or 2.6% of GDP.
Walmart’s share of GDP more than doubled. In 2016, an increase or
decrease in Walmart’s revenue would contribute twice as much to
aggregate volatility.
No one refutes the logic of this argument. The relevant question becomes
whether a calibrated model produces effects with magnitudes that
correspond to actual volatility levels. The calibrated fit proves quite close.
Firm size distributions correlate nicely with the historical evidence of the
Great Moderation. That correlation does not prove that it is changes in firm
size distribution (instead of effective government management of the
economy or better inventory control) that caused the moderation, but it
does prevent us from rejecting the model.20 The evidence also provides
reason to keep this model in our quiver when we evaluate fluctuations in
the future.
Contemplating a Long-Tailed World
In long-tailed distributions, large events occur with sufficient probability to
be of concern. In the models we covered, long-tailed distributions arise
because of feedbacks and interdependencies. We should pay heed to that
observation. As our world becomes more interconnected and feedbacks
increase, we should see more long tails. And the current long tails that we
see may get stretched even further. Inequities may increase, catastrophes
grow larger, and volatility become more pronounced. None of these is
desirable.
So far, we have discussed these possibilities at macro levels. They also
occur at smaller scales. Boston’s “Big Dig,” a three-and-a-half-mile tunnel
through the center of the city, provides an example of a moderate-scale
catastrophe. The project cost taxpayers $14 billion, more than three times
the original estimate, and it became the most expensive highway project in
the history of the United States. Model thinking frames the Big Dig not as a
single project but as an aggregate of subprojects: digging a trench, pouring
a concrete tunnel, engineering a drainage system, and building walls and a
roof. The project’s total cost equals the sum of the subprojects’ costs.
If the costs of each subproject had been additive, then the distribution of
costs for the project would have been normally distributed.21 However, the
subprojects’ costs were connected. When the epoxy used to glue the roof
into place proved inadequate, it was replaced with a costlier, stronger epoxy
and, therefore, raised the cost of the project. The failure of the first epoxy
created additional costs associated with removing and replacing the
collapsed roof. Those efforts in turn required redoing several other parts of
the project. Overall costs more than doubled because each project had to be
undone and then redone. Interdependencies led to a large, and costly, event.
The potential for large events makes planning difficult. The distribution of
natural disasters such as earthquakes satisfy a power law. Thus, most events
will be small, but some will be large. If catastrophic events follow a powerlaw distribution with an exponent near 2, then governments need to keep a
very large amount of money in reserve or at least at the ready. They need to
prepare for a very rainy day. If governments do so by maintaining huge
surpluses in an emergency fund, they may be able to stop themselves from
spending that money or cutting taxes if no large event occurs.
Search and Opportunity
We can apply our knowledge of distributions within a class of search
models to explain why the number of opportunities a person receives may
correlate strongly with success. We embed one class of models, our
distribution models, within a second class, search models. When we search,
whether for a new pair of shoes, a career, or a vacation spot, we do not
know our choice’s value until we try it, though we may know something
about the distribution of values, such as the mean, standard deviation, and
whether the distribution is normal or has a long tail.
Here, we model choice of profession as a search process. Given a
profession, a person tries a career path, which we model as a draw from a
distribution. We assume that she can either stick with that career or try
again. Trying again corresponds to another draw from the distribution.
Consider, for example, the choice of profession for a talented young
scientist. She could go to medical school or do research in quantum
computing. Medical school offers the safer path. Choosing to work on
quantum computing involves becoming an entrepreneur and taking on more
risk. To account for these differences, we represent the salary distribution
for doctors as a normal distribution with mean $250,000 and a standard
deviation of $25,000, and the salary distribution for the entrepreneurial
career as a power law with an exponent of 3 with an expected salary of
$200,000.22
Within each profession, our scientist can try multiple careers. She can
search. A doctor can switch from oncology to radiology. A failed
entrepreneur can pick up the pieces from her start-up and try anew. Each
career switch entails a cost. For a doctor, it means more training. For an
entrepreneur in quantum computing, it means more long nights of work
with little to no compensation.
We assume that our young scientist finds the two professions equally
stimulating and makes her choice based on salary. Our model reveals that
the better choice depends on how many times she can afford to try new
careers. If she must stick with her first career choice, becoming a doctor
offers the higher expected salary. If she has sufficient resources to continue
trying to be an entrepreneur, eventually she will get a high-paying draw
from the long tail. The figure below shows the average largest salary across
twenty trials assuming one, two, five, and ten career searches within each
profession. If she has the opportunity to try her hand at quantum computing
start-ups ten times, her salary will be nearly double what she would earn
had she chosen medical school and experimented with ten careers.
image
Average Income as a Function of Number of Opportunities
If wealth and family support correlate with the number of opportunities a
person has to try new careers, our model predicts that wealthier people will
choose riskier professions.23 Evidence on patents aligns with the model.
The probability that someone writes a patent correlates with that person’s
mathematical abilities. People in the top 1% of math ability are far more
likely to hold a patent. Among the top 1%, those from families in the top
10% of the income distribution are even more likely to hold a patent.24 At
least two models could explain the disparity. One model could assume that
poorer talented students never attend college. They may be working routine
jobs and never have the choice between medical school or quantum
computing. Or, perhaps poorer students choose safer careers.
The logic that an increase in opportunities creates an incentive for risk
applies widely. Venture capitalists take risks because they make multiple
investments. An early investment in a single unicorn, a billion-dollar
company, more than compensates for many losers. Pharmaceutical research
laboratories also take risks, spending billions on drug research. We can
apply the same logic when deciding where to eat lunch. When driving
cross-country and stopping in an unfamiliar town, we may want to eat at a
chain restaurant. If moving to that town, we should experiment.