12. Entropy: Modeling Uncertainty Information is the resolution of uncertainty. —Claude Shannon In this chapter, we introduce entropy, a formal measure of uncertainty. With it, we can show an equivalence between uncertainty, information content, and surprise. Low entropy corresponds to low uncertainty and little information being revealed. When an outcome occurs in a low-entropy system, such as the sun rising in the east, we experience little surprise. In high-entropy systems, say the drawing of numbers in a lottery, the outcomes are uncertain and when realized, they reveal information. We experience surprise. Using entropy, we can compare disparate phenomena. We can say whether election outcomes in New Zealand are more uncertain than outcomes of United Nations votes on censure. We can compare the uncertainty of stock prices to the uncertainty of outcomes of sporting events. We can also use entropy to distinguish between the four classes of outcomes: equilibrium, periodicity, complexity, and randomness. We can distinguish complex patterns that appear random from true randomness and discern whether what appears to be a pattern is, in fact, random. We can also use entropy to characterize distributions. In the absence of a controlling or regulating force, some populations may drift toward maximal entropy. Given constraints, such as a fixed mean or variance, we can solve for maximum entropy distributions. Maximal entropy results can also guide our modeling choices by justifying some distributions over others. The chapter has five parts. In the first part, we provide intuition for and define information entropy. In the second part, we describe Shannon’s axiomatic foundations for a general class of entropy measures. In the third part, we discuss how to use entropy to distinguish equilibrium, order, randomness, and complexity. In the fourth part, we investigate systems that produce maximal entropy given constraints. We conclude by discussing why, sometimes, we prefer complexity to equilibrium. Information Entropy Entropy measures the uncertainty associated with a probability distribution over outcomes. It therefore also measures surprise. Entropy differs from variance, which measures the dispersion of a set or distribution of numerical values. Uncertainty correlates with dispersion, but the two differ. Distributions with high uncertainty have nontrivial probabilities over many outcomes. Those outcomes need not have numerical values. Distributions with high dispersion take on extreme numerical values. The distinction can be seen in stark relief by comparing a distribution that has maximal entropy with one that has maximal variance. Given outcomes that take values 1 through 8, the distribution that maximizes entropy places equal weight on each outcome.1 The distribution that maximizes variance takes value 1 with probability image and value 8 with probability image , as shown in figure 12.1. image Figure 12.1: Maximal Entropy and Maximal Variance Entropy is defined over probability distributions. It can therefore be applied to distributions over nonnumerical data such as the species of birds in a forest or the market shares of flavors of jam. The formal expression for entropy is written as minus the sum of products of probabilities and their logarithms. That sounds complicated, but it will become intuitive. We begin with the special case of information entropy, which measures uncertainty in terms of number of random flips of a fair coin. Suppose that every family has exactly two children and that boys and girls are equally likely. The sexes of a family’s children (listed by birth order) are equivalent to two coin flips. The distribution over outcomes therefore has an information entropy of 2 because it corresponds to 2 random events. The information content also equals 2 because we could learn the outcomes by asking 2 yes-or-no questions. Similarly, the sexes of the children in families of size three are equivalent to 3 coin flips. To learn about a family’s children, we would need to ask 3 questions. The same logic applies for any number of children. In the general case, to learn the sexes of N children, we would need to ask N questions. Notice that those N questions distinguish among 2N possible birth orders. That mathematical relationship is the key to understanding the entropy measure: N binary random events produce 2N possible outcome sequences, and, equivalently, we could learn the outcome sequence by asking N questions. For this reason, information entropy assigns an uncertainty level (and an information content) of N to an equal distribution over 2N outcomes. To capture that relationship in formal mathematics, we first note that each of the outcome sequences has a probability of image. To convert this to N requires the rather complicated expression image.2 We can generalize this construction to arbitrary probabilities. If an outcome sequence arises with probability p, then we assign an uncertainty log2(p) which approximates the number of yes-or-no questions required to identify the sequence To compute the information entropy of a distribution, we average the expected number of questions across all outcomes, or, as in the example, sequences of outcomes. Information Entropy Given a probability distribution (p1, p2,…pN), the information entropy, H2, equals: image Note: the subscript 2 denotes the use of the base 2 logarithm. At first, the mathematical representation complicates more than it clarifies. Working through an example makes the formula more intuitive. Imagine that families who first have a girl stop having children, and that families who first have a boy have two more children. Half of all families will have a single girl. The half will be split evenly among four outcomes: three boys, two boys followed by a girl, a boy followed by two girls, and a boy followed by a girl followed by another boy. Each of those four outcomes occurs with probability image. Information entropy equals the expected number of questions we must ask to learn the family’s children. We would first ask if the first child is a girl. With probability image the answer is yes, and we need not ask more questions. Thus, half of the time, we ask one question. We can write this as image. If the answer is no, we must ask two more questions for a total of three questions. Each of those four cases occurs with probability image, so each contributes image × 3 to information entropy. We write each as image. Information entropy equals 2, the sum of the five terms.3 Notation and logarithms aside, the intuition should be clear: information entropy corresponds to the expected number of yes-or-no questions. If we have to ask a lot of questions, the distribution is uncertain. Knowing the outcome reveals information. Axiomatic Foundations of Entropy Axiomatic Foundations: Entropy image The above class of entropy measures uniquely satisfies the following four axioms: Symmetric, continuous function: H(σ( image)) = H( image) for any σ that permutes the probabilities. Maximization: H( image) is maximized at pi = image for all N. Zero Property: H(1, 0, 0,…, 0) = 0. Decomposability: If image image where image and image To arrive at a general expression for entropy, we take an axiomatic approach. Claude Shannon imposed four conditions on his measure. The first three are easy to understand. It needed to be continuous and symmetric, maximized when outcomes occur with equal probability, and equal zero for certain outcomes. The fourth condition (decomposability) requires that the entropy of a probability distribution defined over n categories each with m subcategories equals the entropy of the distribution over the categories plus the sum of the entropies of each of the subcategories. This is a natural assumption for products of distributions. For example, in the case where outcomes are the product of two independent events, the assumption implies that the information content of the joint event equals the sum of the information contents of each event separately. Shannon then proved that a general class of entropy measures uniquely satisfies those axioms. As was the case for the axioms that characterize Shapley values, the contribution of these axioms resides less in their existence than in their reasonableness. A clever mathematician can always construct axioms that uniquely define a function. The first two axioms are difficult to question. We might quibble with the arbitrariness of setting the uncertainty of a known distribution at zero, but it is an appropriate benchmark. Another possibility would be to assign 1 as the uncertainty of a known distribution.4 The decomposability axiom, though complicated to explain, is also difficult to challenge. The uncertainty of two combined random events should equal the sum of the uncertainties of each event. Overall, the axioms are more than defensible. They are, in fact, hard to dispute. Using Entropy to Distinguish Classes of Outcomes We now show how the entropy measure can help us to categorize empirical data and model output within Wolfram’s four classes: equilibrium, cyclic (periodic), random, and complex.5 In Wolfram’s classification, a pencil resting on a desk is in equilibrium. The planets orbiting the sun are in a cycle. A sequence of coin flips is random, so are (approximately) stock prices on the New York Stock Exchange, as we shall learn in the next chapter. Finally, the neuronal firings in a person’s brain are complex; they do not fire randomly, nor do they fire in a fixed pattern. Figure 12.2 represents these four categories graphically. Equilibrium outcomes have no uncertainty, and therefore, have an entropy equal to zero. Cyclic (or periodic) processes have low entropy that does not change with time, and perfectly random processes have maximal entropy. Complexity has intermediate entropy—it lies between ordered and random. While entropy gives us a definitive answer in the two extreme cases, equilibrium and random, it does not for cyclic and complex outcomes. We will have to use other measures to distinguish those cases. image Figure 12.2: Wolfram’s Four Classes To classify a time series of data, we calculate the information entropy across subsequences of different lengths. Suppose that a man keeps track of the type of hat he wears each day—either a beret (B) or a fedora (F). His choices over a year create a binary time series of 365 events. We can first calculate the entropy of sequences of length 1, that is, we calculate the entropy over the probability of wearing each type of hat. If we find that he is equally likely to wear each type of hat, the entropy over sequences of length 1 equals 1. We can therefore rule out equilibrium, as he changes his choices, but any of the other three categories are possible. To determine the category, we next compute the entropy of sequences of length 2 through 6. If all have maximal entropy, then we can rule out a simple cycle. Suppose that as we consider longer sequences the entropy increases slowly until it reaches a maximum of 8. In other words, no matter how long the subsequence, the entropy never exceeds 8. An entropy of 8 is equivalent to an equal distribution across 256 outcomes. That cannot be a simple cycle. It is more representative of a complex sequence containing structure and patterns. We cannot say for sure that the time series is complex. It might be that the person is trying to be random, yet fails. Maximal Entropy and Distributional Assumptions Many of the situations that we model include uncertainty, and, as modelers, we must make assumptions about those distributions. As a rule, we want to avoid making ad hoc assumptions. It may be that we have some understanding of the process that produces the distribution. If so, we can often derive the statistical structure produced by that process using our logic-structure-function approach. For example, suppose that we want to make an assumption about the distribution of the total value of the items up for auction at an estate sale. The total value equals the sum of the values of the individual items. We can therefore invoke the central limit theorem and assume a normal distribution. We might also assume a normal distribution for the possible values of a house, as the house’s value depends on its attributes: the number of bedrooms, bathrooms, and the size of the lot. A normal distribution may not make sense for the possible values for a piece of art or a rare manuscript. In those cases, we may have little understanding of the process that determines value. One approach is to assume a distribution with maximal uncertainty, that is, the maximal entropy distribution. The shape of the maximal entropy distribution depends on the constraints. As we have already seen, if we assume a minimal and maximal value, the uniform distribution maximizes entropy. Many social science models in textbooks and journals assume uniform distributions. We might question that assumption on the grounds that few distributions in the real world are uniform. However, a principle of indifference—if we know nothing other than the range or set of possibilities—can justify the uniform distribution. In some cases, we may know the mean of the distribution and also know that all values must be positive. Given those constraints, the maximal entropy distribution must have a long tail, and as we spread the distribution across more values, we must balance high values with many low-value outcomes. It can be shown that the entropy-maximizing distribution will be an exponential distribution. Thus, if we are writing a model that assumes distribution of website hits or market shares, in the absence of data an exponential distribution is a natural assumption. Finally, if we fix the mean and the variance (and allow negative values), then the maximal entropy distribution is the normal distribution. The logic here is similar to the previous case. To create more uncertainty, we create extreme values. Here we can balance positive and negative values and not change the mean. However, doing so increases the variance, so we must add more values near the mean, resulting in a bell curve. We can interpret these maximal entropy distributions within the logicstructure-function framework. If we thought that in a given social, biological, or physical context a micro-level process was maximizing entropy, then we should expect one of these distributions. Alternatively, we might assume a micro-level process and be able to show that entropy increases. If so, one of these distributions would emerge. Maximal Entropy Distributions Uniform distribution: Maximizes entropy given a range, [a, b]. Exponential distribution: Maximizes entropy given a mean, μ. Normal distribution: Maximizes entropy given a mean, μ, and a variance, σ2. We can also interpret these results as exploratory. We may encounter data that is exponentially or normally distributed. Though we are not obliged to ask if some underlying behavior is increasing entropy subject to a constraint, we might gain a novel insight by doing so. Previously, we explained the normal distribution of heights, weights, and lengths of species by an appeal to the central limit theorem. Here we present a different, model-based explanation. If mutation maximizes entropy (to best explore niches), and if average size and total dispersion are fixed, then the distribution of sizes will be normal. The point is not that the maximal entropy approach offers a better explanation, but that maximizing entropy given constraints results in a normal distribution. So, when we see a normal distribution, it could be the result of entropy maximization. Positive and Normative Implications of Entropy We have seen how entropy measures uncertainty, information, and surprise, how it differs from variance, which measures dispersion, and how it can help us classify and compare classes of outcomes. Later, in Chapters 13 and 14, when we study random walks and path dependence, we use entropy to identify randomness and to measure the extent of path dependence. We can put the entropy measure to use in any number of real-world applications. We can measure whether an intervention in financial markets increases or decreases uncertainty. We can test whether or not outcomes in elections, sporting events, or games of chance are random. In each of these applications, entropy functions as a positive measure. It tells us what the world is, not what it should be. Entropy in a system is not intrinsically bad or good. How much entropy we desire depends on the situation. In constructing a tax code, we might want an equilibrium pattern of behaviors. We would not want randomness. In designing a city, we may seek complexity. Equilibrium or even cycles would be dull. We would prefer a city to be teeming with life, to offer opportunities for fortuitous meetings and interactions. More entropy would be better, but only to a point. We would not want randomness. Randomness would make planning difficult and possibly overwhelm our cognitive abilities. Ideally, the world produces some complexity and we live in interesting times. The architect Christopher Alexander shows how geometric properties such as strong centers, thick boundaries, and non-separateness can produce complex, living buildings, neighborhoods, and cities.6 Alexander argues for complexity in cities and in living space. Central bankers may be less fond of complexity. They may prefer predictable equilibrium outcomes and stable growth paths. A central takeaway from this chapter is that we often care whether a system goes to equilibrium, produces a pattern or randomness, or whether it results in complex, novel sequences of patterns. By using models, we can perhaps see which will arise and, in some cases, design systems that produced the class of outcome we desire, whether that be complexity or equilibrium.