information theory

The Data Processing Inequality and bounded rationality

I have long harbored the hunch that information theory, in the classic Shannon sense, and social theory are deeply linked. It has proven to be very difficult to find an audience for this point of view or an opportunity to work on it seriously. Shannon’s information theory is widely respected in engineering disciplines; many social theorists who are unfamiliar with it are loathe to admit that something from engineering should carry essential insights for their own field. Meanwhile, engineers are rarely interested in modeling social systems.

I’ve recently discovered an opportunity to work on this problem through my dissertation work, which is about privacy engineering. Privacy is a subtle social concept but also one that has been rigorously formalized. I’m working on formal privacy theory now and have been reminded of a theorem from information theory: the Data Processing Theorem. What strikes me about this theorem is that is captures an point that comes up again and again in social and political problems, though it’s a point that’s almost never addressed head on.

The Data Processing Inequality (DPI) states that for three random variables, X, Y, and Z, arranged in Markov Chain such that $X \rightarrow Y \rightarrow Z$ , then $I(X,Z) \leq I(X,Y)$ , where here $I$ stands for mutual information. Mutual information is a measure of how much two random variables carry information about each other. If $I(X,Y) = 0$, that means the variables are independent. $I(X,Y) \geq 0$ always–that’s just a mathematical fact about how it’s defined.

The implications of this for psychology, social theory, and artificial intelligence are I think rather profound. It provides a way of thinking about bounded rationality in a simple and generalizable way–something I’ve been struggling to figure out for a long time.

Suppose that there’s a big world out the, $W$ and there’s am organism, or a person, or a sociotechnical organization within it, $Y$ . The world is big and complex, which implies that it has a lot of informational entropy, $H(W)$ . Through whatever sensory apparatus is available to $Y$ , it acquires some kind of internal sensory state. Because this organism is much small than the world, its entropy is much lower. There are many fewer possible states that the organism can be in, relative to the number of states of the world. $H(W) >> H(Y)$ . This in turn bounds the mutual information between the organism and the world: $I(W,Y) \leq H(Y)$

Now let’s suppose the actions that the organism takes, $Z$ depend only on its internal state. It is an agent, reacting to its environment. Well whatever these actions are, they can only be so calibrated to the world as the agent had capacity to absorb the world’s information. I.e., $I(W,Z) \leq H(Y) << H(W)$ . The implication is that the more limited the mental capacity of the organism, the more its actions will be approximately independent of the state of the world that precedes it.

There are a lot of interesting implications of this for social theory. Here are a few cases that come to mind.

I've written quite a bit here (blog links) and here (arXiv) about Bostrom’s superintelligence argument and why I’m generally not concerned with the prospect of an artificial intelligence taking over the world. My argument is that there are limits to how much an algorithm can improve itself, and these limits put a stop to exponential intelligence explosions. I’ve been criticized on the grounds that I don’t specify what the limits are, and that if the limits are high enough then maybe relative superintelligence is possible. The Data Processing Inequality gives us another tool for estimating the bounds of an intelligence based on the range of physical states it can possibly be in. How calibrated can a hegemonic agent be to the complexity of the world? It depends on the capacity of that agent to absorb information about the world; that can be measured in information entropy.

A related case is a rendering of Scott’s Seeing Like a State arguments. Why is it that “high modernist” governments failed to successfully control society through scientific intervention? One reason is that the complexity of the system they were trying to manage vastly outsized the complexity of the centralized control mechanisms. Centralized control was very blunt, causing many social problems. Arguably, behavioral targeting and big data centers today equip controlling organizations with more informational capacity (more entropy), but they
still get it wrong sometimes, causing privacy violations, because they can’t model the entirety of the messy world we’re in.

The Data Processing Inequality is also helpful for explaining why the world is so messy. There are a lot of different agents in the world, and each one only has so much bandwidth for taking in information. This means that most agents are acting almost independently from each other. The guiding principle of society isn’t signal, it’s noise. That explains why there are so many disorganized heavy tail distributions in social phenomena.

Importantly, if we let the world at any time slice be informed by the actions of many agents acting nearly independently from each other in the slice before, then that increases the entropy of the world. This increases the challenge for any particular agent to develop an effective controlling strategy. For this reason, we would expect the world to get more out of control the more intelligence agents are on average. The popularity of the personal computer perhaps introduced a lot more entropy into the world, distributed in an agent-by-agent way. Moreover, powerful controlling data centers may increase the world’s entropy, rather than redtucing it. So even if, for example, Amazon were to try to take over the world, the existence of Baidu would be a major obstacle to its plans.

There are a lot of assumptions built into these informal arguments and I’m not wedded to any of them. But my point here is that information theory provides useful tools for thinking about agents in a complex world. There’s potential for using it for modeling sociotechnical systems and their limitations.

Economics of expertise and information services

We have no considered two models of how information affects welfare outcomes.

In the first model, inspired by an argument from Richard Posner, the are many producers (employees, in the specific example, but it could just as well be cars, etc.) and a single consumer. When the consumer knows nothing about the quality of the producers, the consumer gets an average quality producer and the producers split the expected utility of the consumer’s purchase equally. When the consumer is informed, she benefits and so does the highest quality producer, at the detriment of the other producers.

In the second example, inspired by Shapiro and Varian’s discussion of price differentiation in the sale of information goods, there was a single producer and many consumers. When the producer knows nothing about the “quality” of the consumers–their willingness to pay–the producer charges all consumers a profit-maximizing price. This price leaves many customers out of reach of the product, and many others getting a consumer surplus because the product is cheap relative to their demand. When the producer is more informed, they make more profit by selling as personalized prices. This lets the previously unreached customers in on the product at a compellingly low price. It also allows the producer to charge higher prices to willing customers; they capture what was once consumer surplus for themselves.

In both these cases, we have assumed that there is only one kind of good in play. It can vary numerically in quality, which is measured in the same units as cost and utility.

In order to bridge from theory of information goods to theory of information services, we need to take into account a key feature of information services. Consumers buy information when they don’t know what it is they want, exactly. Producers of information services tailor what they provide to the specific needs of the consumers. This is true for information services like search engines but also other forms of expertise like physician’s services, financial advising, and education. It’s notable that these last three domains are subject to data protection laws in the United States (HIPAA, GLBA, and FERPA) respectively, and on-line information services are an area where privacy and data protection are a public concern. By studying the economics of information services and expertise, we may discover what these domains have in common.

Let’s consider just a single consumer and a single producer. The consumer has a utility function $\vec{x} \sim X$ (that is, sampled from random variable $X$ , specifying the values it gets for the consumption of each of $m = \vert J \vert$ products. We’ll denote with $x_j$ the utility awarded to the consumer for the consumption of product $j \in J$ .

The catch is that the consumer does not know $X$ . What they do know is $y \sim Y$ , which is correlated with $X$ is some way that is unknown to them. The consumer tells the producer $y$ , and the producer’s job is to recommend to them $j \in J$ that will most benefit them. We’ll assume that the producer is interested in maximizing consumer welfare in good faith because, for example, they are trying to promote their professional reputation and this is roughly in proportion to customer satisfaction. (Let’s assume they pass on costs of providing the product to the consumer).

As in the other cases, let’s consider first the case where the acting party has no useful information about the particular customer. In this case, the producer has to choose their recommendation $\hat j$ based on their knowledge of the underlying probability distribution $X$ , i.e.:

$\hat j = arg \max_{j \in J} E[X_j]$

where $X_j$ is the probability distribution over $x_j$ implied by $X$ .

In the other extreme case, the producer has perfect information of the consumer’s utility function. They can pick the truly optimal product:

$\hat j = arg \max_{j \in J} x_j$

How much better off the consumer is in the second case, as opposed to the first, depends on the specifics of the distribution $X$ . Suppose $X_j$ are all independent and identically distributed. Then an ignorant producer would be indifferent to the choice of $\hat j$ , leaving the expected outcome for the consumer $E[X_j]$ , whereas the higher the number of products $m$ the more $\max_{j \in J} x_j$ will approach the maximum value of $X_j$ .

In the intermediate cases where the producer knows $y$ which carries partial information about $\vec{x}$ , they can choose:

$\hat j = arg \max_{j \in J} E[X_j \vert y] =$

$arg \max_{j \in J} \sum x_j P(x_j = X_j \vert y) =$

$arg \max_{j \in J} \sum x_j P(y \vert x_j = X_j) P(x_j = X_j)$

The precise values of the terms here depend on the distributions $X$ and $Y$ . What we can know in general is that the more informative is $y$ is about $x_j$ , the more the likelihood term $P(y \vert x_j = X_j)$ dominates the prior $P(x_j = X_j)$ and the condition of the consumer improves.

Note that in this model, it is the likelihood function $P(y \vert x_j = X_j)$ that is the special information that the producer has. Knowledge of how evidence (a search query, a description of symptoms, etc.) are caused by underlying desire or need is the expertise the consumers are seeking out. This begins to tie the economics of information to theories of statistical information.

Digifesto

Tag: information theory

December 18, 2017

The Data Processing Inequality and bounded rationality

September 10, 2017

Economics of expertise and information services