## Tag: Bayesian updating

### Information flow in economics

We have formalized three different cases of information economics:

What we discovered is that each of these cases has, to some extent, a common form. That form is this:

There is a random variable of interest, $x \sim X$ (that is, a value $x$ sampled from a probability distribution $X$), that has direct effect on the welfare outcome of decisions made be agents in the economy. In our cases this was the aptitude of job applicants, consumers willingness to pay, and the utility of receiving a range of different expert recommendations, respectively.

In the extreme cases, the agent at the focus of the economic model could act with extreme ignorance of $x$, or extreme knowledge of it. Generally, the agent’s situation improves the more knowledgeable they are about $x$. The outcomes for the subjects of $X$ vary more widely.

We also considered the possibility that the agent has access to partial information about $X$ through the observation of a different variable $y \sim Y$. Upon observation of $y$, they can make their judgments based on an improved subjective expectation of the unknown variable, $P(x \vert y)$. We assumed that the agent was a Bayesian reasoner and so capable of internalizing evidence according to Bayes rule, hence they are able to compute:

$P(X \vert Y) \propto P(Y \vert X) P(X)$

However, this depends on two very important assumptions.

The first is that the agent knows the distribution $X$. This is the prior in their subjective calculation of the Bayesian update. In our models, we have been perhaps sloppy in assuming that this prior probability corresponds to the true probability distribution from which the value $x$ is drawn. We are somewhat safe in this assumption because for the purposes of determining strategy, only subjective probabilities can be taken into account and we can relax the distribution to encode something close to zero knowledge of the outcome if necessary. In more complex models, the difference between agents with different knowledge of $X$ may be more strategically significant, but we aren’t there yet.

The second important assumption is that the agent knows the likelihood function $P(Y | X)$. This is quite a strong assumption, as it implies that the agent knows truly how Y covaries with X, allowing them to “decode” the message $y$ into useful information about $x$.

It may be best to think of access and usage of the likelihood function as a rare capability. Indeed, in our model of expertise, the assumption was that the service provider (think doctor) knew more about the relationship between $X$ (appropriate treatment) and $Y$ (observable symptoms) than the consumer (patient) did. In the case of companies that use data science, the idea is that some combination of data and science gives the company an edge in knowing the true value of some uncertain property than its competitors.

What we are discovering is that it’s not just the availability of $y$ that matters, but also the ability to interpret $y$ with respect to the probability of $x$. Data does not speak for itself.

This incidentally ties in with a point which we have perhaps glossed over too quickly in the present discussion, which is what is information, really? This may seem like a distraction in a discussion about economics but it is a question that’s come up in my own idiosyncratic “disciplinary” formation. One of the best intuitive definitions of information is provided by philosopher Fred Dretske (1981; 1983). Made a presentation of Fred Dretske’s view on information and its relationship to epistemological skepticism and Shannon information theory; you can find this presentation here. But for present purposes I want to call attention to his definition of what it means for a message to carry information, which is:

[A] message carries the information that X is a dingbat, say, if and only if one could learn (come to know) that X is a dingbat from the message.

When I say that one could learn that X was a dingbat from the message, I mean, simply, that the message has whatever reliable connection with dingbats is required to enable a suitably equipped, but otherwise ignorant receiver, to learn from it that X is a dingbat.

This formulation is worth mentioning because it supplies a kind of philosophical validation for our Bayesian formulation of information flow in the economy. We are modeling situations where Y is a signal that is reliably connected with X such that instantiations of Y carry information about the value of the X. We might express this in terms of conditional entropy:

$H(X|Y) < H(X)$

While this is sufficient for Y to carry information about X, it is not sufficient for any observer of Y to consequently know X. An important part of Dretske's definition is that the receiver must be suitably equipped to make the connection.

In our models, the “suitably equipped” condition is represented as the ability to compute the Bayesian update using a realistic likelihood function $P(Y \vert X)$. This is a difficult demand. A lot of computational statistics has to do with the difficulty of tractably estimating the likelihood function, let alone computing it perfectly.

References

Dretske, F. I. (1983). The epistemology of belief. Synthese, 55(1), 3-19.

Dretske, F. (1981). Knowledge and the Flow of Information.

### The recalcitrance of prediction

We have identified how Bostrom’s core argument for superintelligence explosion depends on a crucial assumption. An intelligence explosion will happen only if the kinds of cognitive capacities involved in instrumental reason are not recalcitrant to recursive self-improvement. If recalcitrance rises comparably with the system’s ability to improve itself, then the takeoff will not be fast. This significantly decreases the probability of decisively strategic singleton outcomes.

In this section I will consider the recalcitrance of intelligent prediction, which is one of the capacities that is involved in instrumental reason (another being planning). Prediction is a very well-studied problem in artificial intelligence and statistics and so is easy to characterize and evaluate formally.

Recalcitrance is difficult to formalize. Recall that in Bostrom’s formulation:

$\frac{dI}{dt} = \frac{O(I)}{R(I)}$

One difficulty in analyzing this formula is that the units are not specified precisely. What is a “unit” of intelligence? What kind of “effort” is the unit of optimization power? And how could one measure recalcitrance?

A benefit of looking at a particular intelligent task is that it allows us to think more concretely about what these terms mean. If we can specify which tasks are important to consider, then we can take the level of performance on those well-specified class of problems as measures of intelligence.

Prediction is one such problem. In a nutshell, prediction comes down to estimating a probability distribution over hypotheses. Using the Bayesian formulation of statistical influence, we can represent the problem as:

$P(H|D) = \frac{P(D|H) P(H)}{P(D)}$

Here, $P(H|D)$ is the posterior probability of a hypothesis $H$ given observed data $D$. If one is following statistically optimal procedure, one can compute this value by taking the prior probability of the hypothesis $P(H)$, multiplying it by the likelihood of the data given the hypothesis $P(D|H)$, and then normalizing this result by dividing by the probability of the data over all models, $P(D) = \sum_{i}P(D|H_i)P(H_i)$.

Statisticians will justifiably argue whether this is the best formulation of prediction. And depending on the specifics of the task, the target value may well be some function of posterior (such as the hypothesis with maximum likelihood) and the overall distribution may be secondary. These are valid objections that I would like to put to one side in order to get across the intuition of an argument.

What I want to point out is that if we look at the factors that affect performance on prediction problems, there a very few that could be subject to algorithmic self-improvement. If we think that part of what it means for an intelligent system to get more intelligent is to improve its ability of prediction (which Bostrom appears to believe), but improving predictive ability is not something that a system can do via self-modification, then that implies that the recalcitrance of prediction, far from being constant or lower, actually approaches infinity with respect the an autonomous system’s capacity for algorithmic self-improvement.

So, given the formula above, in what ways can an intelligent system improve its capacity to predict? We can enumerate them:

• Computational accuracy. An intelligent system could be better or worse at computing the posterior probabilities. Since most of the algorithms that do this kind of computation do so with numerical approximation, there is the possibility of an intelligent system finding ways to improve the accuracy of this calculation.
• Computational speed. There are faster and slower ways to compute the inference formula. An intelligent system could come up with a way to make itself compute the answer faster.
• Better data. The success of inference is clearly dependent on what kind of data the system has access to. Note that “better data” is not necessarily the same as “more data”. If the data that the system learns from is from a biased sample of the phenomenon in question, then a successful Bayesian update could make its predictions worse, not better. Better data is data that is informative with respect to the true process that generated the data.
• Better prior. The success of inference depends crucially on the prior probability assigned to hypotheses or models. A prior is better when it assigns higher probability to the true process that generates observable data, or models that are ‘close’ to that true process. An important point is that priors can be bad in more than one way. The bias/variance tradeoff is well-studied way of discussing this. Choosing a prior in machine learning involves a tradeoff between:
1. Bias. The assignment of probability to models that skew away from the true distribution. An example of a biased prior would be one that gives positive probability to only linear models, when the true phenomenon is quadratic. Biased priors lead to underfitting in inference.
2. Variance.The assignment of probability to models that are more complex than are needed to reflect the true distribution. An example of a high-variance prior would be one that assigns high probability to cubic functions when the data was generated by a quadratic function. The problem with high variance priors is that they will overfit data by inferring from noise, which could be the result of measurement error or something else less significant than the true generative process.

In short, there best prior is the correct prior, and any deviation from that increases error.

Now that we have enumerate the ways in which an intelligent system may improve its power of prediction, which is one of the things that’s necessary for instrumental reason, we can ask: how recalcitrant are these factors to recursive self-improvement? How much can an intelligent system, by virtue of its own intelligence, improve on any of these factors?

Let’s start with computational accuracy and speed. An intelligent system could, for example, use some previously collected data and try variations of its statistical inference algorithm, benchmark their performance, and then choose to use the most accurate and fastest ones at a future time. Perhaps the faster and more accurate the system is at prediction generally, the faster and more accurately it would be able to engage in this process of self-improvement.

Critically, however, there is a maximum amount of performance that one can get from improvements to computational accuracy if you hold the other factors constant. You can’t be more accurate than perfectly accurate. Therefore, at some point recalcitrance of computational accuracy rises to infinity. Moreover, we would expect that effort made at improving computational accuracy would exhibit diminishing returns. In other words, recalcitrance of computational accuracy climbs (probably close to exponentially) with performance.

What is the recalcitrance of computational speed at inference? Here, performance is limited primarily by the hardware on which the intelligent system is implemented. In Bostrom’s account of superintelligence explosion, he is ambiguous about whether and when hardware development counts as part of a system’s intelligence. What we can say with confidence, however, is that for any particular piece of hardware there will be a maximum computational speed attainable with with, and that recursive self-improvement to computational speed can at best approach and attain this maximum. At that maximum, further improvement is impossible and recalcitrance is again infinite.