statistics | Digifesto

May 20, 2018

General intelligence, social privilege, and causal inference from factor analysis

I came upon this excellent essay by Cosma Shalizi about how factor analysis has been spuriously used to support the scientific theory of General Intelligence (i.e., IQ). Shalizi, if you don’t know, is one of the best statisticians around. He writes really well and isn’t afraid to point out major blunders in things. He’s one of my favorite academics, and I don’t think I’m alone in this assessment.

First, a motive: Shalizi writes this essay because he thinks the scientific theory of General Intelligence, or a g factor that is some real property of the mind, is wrong. This theory is famous because (a) a lot of people DO believe in IQ as a real feature of the mind, and (b) a significant percentage of these people believe that IQ is hereditary and correlated with race, and (c) the ideas in (b) are used to justify pernicious and unjust social policy. Shalizi, being a principled statistician, appears to take scientific objection to (a) independently of his objection to (c), and argues persuasively that we can reject (a). How?

Shalizi’s point is that the general intelligence factor g is a latent variable that was supposedly discovered using a factor analysis of several different intelligence tests that were supposed to be independent of each other. You can take the data from these data sets and do a dimensionality reduction (that’s what factor analysis is) and get something that looks like a single factor, just as you can take a set of cars and do a dimensionality reduction and get something that looks like a single factor, “size”. The problem is that “intelligence”, just like “size”, can also be a combination of many other factors that are only indirectly associated with each other (height, length, mass, mass of specific components independent of each other, etc.). Once you have many different independent factors combining into one single reduced “dimension” of analysis, you no longer have a coherent causal story of how your general latent variable caused the phenomenon. You have, effectively, correlation without demonstrated causation and, moreover, the correlation is a construct of your data analysis method, and so isn’t really even telling you what correlations normally tell you.

To put it another way: the fact that some people seem to be generally smarter than other people can be due to thousands of independent factors that happen to combine when people apply themselves to different kinds of tasks. If some people were NOT seeming generally smarter than others, that would allow you to reject the hypothesis that there was general intelligence. But the mere presence of the aggregate phenomenon does not prove the existence of a real latent variable. In fact, Shalizi goes on to say, when you do the right kinds of tests to see if there really is a latent factor of ‘general intelligence’, you find that there isn’t any. And so it’s just the persistent and possibly motivated interpretation of the observational data that allows the stubborn myth of general intelligence to continue.

Are you following so far? If you are, it’s likely because you were already skeptical of IQ and its racial correlates to begin with. Now I’m going to switch it up though…

It is fairly common for educated people in the United States (for example) to talk about “privilege” of social groups. White privilege, male privilege–don’t tell me you haven’t at least heard of this stuff before; it is literally everywhere on the center-left news. Privilege here is considered to be a general factor that adheres in certain social groups. It is reinforced by all manner of social conditioning, especially through implicit bias in individual decision-making. This bias is so powerful it extends not to just cases of direct discrimination but also in cases where discrimination happens in a mediated way, for example through technical design. The evidence for these kinds of social privileging effects is obvious: we see inequality everywhere, and we can who is more powerful and benefited by the status quo and who isn’t.

You see where this is going now. I have the momentum. I can’t stop. Here it goes: Maybe this whole story about social privilege is as spuriously supported as the story about general intelligence? What if both narratives were over-interpretations of data that serve a political purpose, but which are not in fact based on sound causal inference techniques?

How could this be? Well, we might gather a lot of data about people: wealth, status, neighborhood, lifespan, etc. And then we could run a dimensionality reduction/factor analysis and get a significant factor that we could name “privilege” or “power”. Potentially that’s a single, real, latent variable. But also potentially it’s hundreds of independent factors spuriously combined into one. It would probably, if I had to bet on it, wind up looking a lot like the factor for “general intelligence”, which plays into the whole controversy about whether and how privilege and intelligence get confused. You must have heard the debates about, say, representation in the technical (or other high-status, high-paying) work force? One side says the smart people get hired; the other side say it’s the privileged (white male) people that get hired. Some jerk suggests that maybe the white males are smarter, and he gets fired. It’s a mess.

I’m offering you a pill right now. It’s not the red pill. It’s not the blue pill. It’s some other colored pill. Green?

There is no such thing as either general intelligence or group based social privilege. Each of these are the results of sloppy data compression over thousands of factors with a loose and subtle correlational structure. The reason why patterns of social behavior that we see are so robust against interventions is that each intervention can work against only one or two of these thousands of factors at a time. Discovering the real causal structure here is hard partly because the effect sizes are very small. Anybody with a simple explanation, especially a politically convenient explanation, is lying to you but also probably lying to themselves. We live in a complex world that resists our understanding and our actions to change it, though it can be better understood and changed through sound statistics. Most people aren’t bothering to do this, and that’s why the world is so dumb right now.

4 Comments

December 18, 2017

The Data Processing Inequality and bounded rationality

I have long harbored the hunch that information theory, in the classic Shannon sense, and social theory are deeply linked. It has proven to be very difficult to find an audience for this point of view or an opportunity to work on it seriously. Shannon’s information theory is widely respected in engineering disciplines; many social theorists who are unfamiliar with it are loathe to admit that something from engineering should carry essential insights for their own field. Meanwhile, engineers are rarely interested in modeling social systems.

I’ve recently discovered an opportunity to work on this problem through my dissertation work, which is about privacy engineering. Privacy is a subtle social concept but also one that has been rigorously formalized. I’m working on formal privacy theory now and have been reminded of a theorem from information theory: the Data Processing Theorem. What strikes me about this theorem is that is captures an point that comes up again and again in social and political problems, though it’s a point that’s almost never addressed head on.

The Data Processing Inequality (DPI) states that for three random variables, X, Y, and Z, arranged in Markov Chain such that $X \rightarrow Y \rightarrow Z$ , then $I(X,Z) \leq I(X,Y)$ , where here $I$ stands for mutual information. Mutual information is a measure of how much two random variables carry information about each other. If $I(X,Y) = 0$, that means the variables are independent. $I(X,Y) \geq 0$ always–that’s just a mathematical fact about how it’s defined.

The implications of this for psychology, social theory, and artificial intelligence are I think rather profound. It provides a way of thinking about bounded rationality in a simple and generalizable way–something I’ve been struggling to figure out for a long time.

Suppose that there’s a big world out the, $W$ and there’s am organism, or a person, or a sociotechnical organization within it, $Y$ . The world is big and complex, which implies that it has a lot of informational entropy, $H(W)$ . Through whatever sensory apparatus is available to $Y$ , it acquires some kind of internal sensory state. Because this organism is much small than the world, its entropy is much lower. There are many fewer possible states that the organism can be in, relative to the number of states of the world. $H(W) >> H(Y)$ . This in turn bounds the mutual information between the organism and the world: $I(W,Y) \leq H(Y)$

Now let’s suppose the actions that the organism takes, $Z$ depend only on its internal state. It is an agent, reacting to its environment. Well whatever these actions are, they can only be so calibrated to the world as the agent had capacity to absorb the world’s information. I.e., $I(W,Z) \leq H(Y) << H(W)$ . The implication is that the more limited the mental capacity of the organism, the more its actions will be approximately independent of the state of the world that precedes it.

There are a lot of interesting implications of this for social theory. Here are a few cases that come to mind.

I've written quite a bit here (blog links) and here (arXiv) about Bostrom’s superintelligence argument and why I’m generally not concerned with the prospect of an artificial intelligence taking over the world. My argument is that there are limits to how much an algorithm can improve itself, and these limits put a stop to exponential intelligence explosions. I’ve been criticized on the grounds that I don’t specify what the limits are, and that if the limits are high enough then maybe relative superintelligence is possible. The Data Processing Inequality gives us another tool for estimating the bounds of an intelligence based on the range of physical states it can possibly be in. How calibrated can a hegemonic agent be to the complexity of the world? It depends on the capacity of that agent to absorb information about the world; that can be measured in information entropy.

A related case is a rendering of Scott’s Seeing Like a State arguments. Why is it that “high modernist” governments failed to successfully control society through scientific intervention? One reason is that the complexity of the system they were trying to manage vastly outsized the complexity of the centralized control mechanisms. Centralized control was very blunt, causing many social problems. Arguably, behavioral targeting and big data centers today equip controlling organizations with more informational capacity (more entropy), but they
still get it wrong sometimes, causing privacy violations, because they can’t model the entirety of the messy world we’re in.

The Data Processing Inequality is also helpful for explaining why the world is so messy. There are a lot of different agents in the world, and each one only has so much bandwidth for taking in information. This means that most agents are acting almost independently from each other. The guiding principle of society isn’t signal, it’s noise. That explains why there are so many disorganized heavy tail distributions in social phenomena.

Importantly, if we let the world at any time slice be informed by the actions of many agents acting nearly independently from each other in the slice before, then that increases the entropy of the world. This increases the challenge for any particular agent to develop an effective controlling strategy. For this reason, we would expect the world to get more out of control the more intelligence agents are on average. The popularity of the personal computer perhaps introduced a lot more entropy into the world, distributed in an agent-by-agent way. Moreover, powerful controlling data centers may increase the world’s entropy, rather than redtucing it. So even if, for example, Amazon were to try to take over the world, the existence of Baidu would be a major obstacle to its plans.

There are a lot of assumptions built into these informal arguments and I’m not wedded to any of them. But my point here is that information theory provides useful tools for thinking about agents in a complex world. There’s potential for using it for modeling sociotechnical systems and their limitations.

July 11, 2017

Why disorganized heavy tail distributions?

I wrote too soon.

Miller and Page (2009) do indeed address “fat tail” distributions explicitly in the same chapter on Emergence discussed in my last post.

However, they do not touch on the possibility that fat tail distributions might be log normal distributions generated by the Central Limit Theorem, as is well-documented by Mitzenmacher (2004).

Instead, they explicitly make a different case. They argue that there are two kinds of complexity:

disorganized complexity, complexity where extreme values balance each other out to create average aggregate behavior according to the Law of Large Numbers and Central Limit Theorem.
organized complexity, where positive and negative feedback can result in extreme outcomes, best characterized by power law or “heavy tail” distributions. Preferential attachment is an example of a feedback based mechanism for generating power law distributions (in the specific case of network degrees).

Indeed, this rough breakdown of possible scientific explanations (the relatively orderly null-hypothesis world of normal distributions, and the chaotic, more accurately rendered world of heavy tail distributions) was the one I had before I started studying complex systems and statistics more seriously in grad school.

Only later did I come to the conclusion that this is a pervasive error, because of the ease with which log normal distributions (which may be “disorganized”) can be confused with power law distributions (which tend to be explained by “organized” processes). I am a bit disappointed that Miller and Page repeat this error, but then again their book is written in 2009. I wonder whether the methodological realization (which I assume I’m not alone in, as I hear it confirmed informally in conversations with smart people sometimes) is relatively recent.

Because this is something so rarely discussed in focus, I think it may be worth pondering exactly why disorganized heavy tail distributions are not favored in the literature. There are several reasons I can think of, which I’ll offer informally here as possibilities or hypotheses.

One reason that I’ve argued for before here is that organized processes are more satisfying as explanations than disorganized processes. Most people are not very good at thinking about probabilities (Tetlock and Gardner (2016) have a great, accessible discussion of why this is the case). So to the extent that the Law of Large Numbers or Central Limit Theorem have true explanatory power, it may not be the kind of explanation most people are willing to entertain. This apparently includes scientists. Rather, a simple explanation in terms of feedback may be the kind of thing that feels like a robust scientific finding, even if there’s something spurious about it when viewed rigorously. (This is related, I think, to arguments about the end of narrative in social science.)

Another reason why disorganized heavy tail distributions may be underutilized as scientific explanations is that it is counter-intuitive that a disorganized process can produce such extreme inequality in outcomes.

This has to do with the key transformation that is the difference between a normal and a log normal distribution. A normal distribution is a bell-shaped distribution one gets when one adds a large number of independent random variables.

The log normal distribution is a heavy tail distribution one gets by multiplying a large number of positively valued independent random variables. While it does have a bell or hump, the top of the bell is not at the arithmetic mean, because the sides of the bell are skewed in size. But this is not necessarily because of the dominance of any particular factor (as would be expected if, for example, a single factor were involved in a positive feedback loop). Rather, it is the mathematical fact of many factors multiplied creating extraordinarily high values which creates the heavy right-hand side of the bell.

One way to put it is that rather than having a “deep” positive feedback loop where a single factor amplifies itself many times over, disorganized heavy tails have “shallow” positive feedback where each of many factors has a single and simultaneous amplifying effect on the impact of all the others. This amplification effect is, like multiplication itself, commutative, which means that no single factor can be considered to be causally prior to the others.

Once again, this defies specificity in an explanation, which may be for some people an explanatory desideratum.

But these extreme values are somehow ones that people demand specific explanations for. This is related, I believe, at the desire for a causal lever with which people can change outcomes, especially their own personal outcomes.

There’s an important political question implicated by all this, which is: why is wealth and power concentrated in the hands of the very few?

One explanation that must be considered is the possibility that society is accumulated history, and over thousands of years an innumerable number of independent factors have affected the distribution of wealth and power. Though rather disorganized, these factors amplify each other multiplicatively, resulting in the distribution that we see today.

The problem with this explanation is that it seems there is little to be done about this state of affairs. A person can effect a handful of the factors that contribute to their own wealth or the wealth of another, but if there are thousands of them then it’s hard to get a grip. One must view the other as simply lucky or unlucky. How can one politically mobilize around that?

References

Miller, John H., and Scott E. Page. Complex adaptive systems: An introduction to computational models of social life. Princeton university press, 2009

Mitzenmacher, Michael. “A brief history of generative models for power law and lognormal distributions.” Internet mathematics 1.2 (2004): 226-251.

Tetlock, Philip E., and Dan Gardner. Superforecasting: The art and science of prediction. Random House, 2016.

Category: statistics