Digifesto

textual causation

A problem that’s coming up for me as a data scientist is the problem of textual causation.

There has been significant interesting research into the problem of extracting causal relationships between things in the world from text about those things. That’s an interesting problem but not the problem I am talking about.

I am talking about the problem of identifying when a piece of text has been the cause of some event in the world. So, did the State of the Union address affect the stock prices of U.S. companies? Specifically, did the text of the State of the Union address affect the stock price? Did my email cause my company to be more productive? Did specifically what I wrote in the email make a difference?

A trivial example of textual causation (if I have my facts right–maybe I don’t) is the calculation of Twitter trending topics. Millions of users write text. That text is algorithmically scanned and under certain conditions, Twitter determines a topic to be trending and displays it to more users through its user interface, which also uses text. The user interface text causes thousands more users to look at what people are saying about the topic, increasing the causal impact of the original text. And so on.

These are some challenges to understanding the causal impact of text:

  • Text is an extraordinarily high-dimensional space with tremendous irregularity in distribution of features.
  • Textual events are unique not just because the probability of any particular utterance is so low, but also because the context of an utterance is informed by all the text prior to it.
  • For the most part, text is generated by a process of unfathomable complexity and interpreted likewise.
  • A single ‘piece’ of text can appear and reappear in multiple contexts as distinct events.

I am interested in whether it is possible to get a grip on textual causation mathematically and with machine learning tools. Bayesian methods theoretically can help with the prediction of unique events. And the Pearl/Rubin model of causation is well integrated with Bayesian methods. But is it possible to use the Pearl/Rubin model to understand unique events? The methodological uses of Pearl/Rubin I’ve seen are all about establishing type causation between independent occurrences. Textual causation appears to be as a rule a kind of token causation in a deeply integrated contextual web.

Perhaps this is what makes the study of textual causation uninteresting. If it does not generalize, then it is difficult to monetize. It is a matter of historical or cultural interest.

But think about all the effort that goes into communication at, say, the operational level of an organization. How many jobs require “excellent communication skills.” A great deal of emphasis is placed not only on that communication happens, but how people communicate.

One way to approach this is using the tools of linguistics. Linguistics looks at speech and breaks it down into components and structures that can be scientifically analyzed. It can identify when there are differences in these components and structures, calling these differences dialects or languages.

analysis of content vs. analysis of distribution of media

A theme that keeps coming up for me in work and conversation lately is the difference between analysis of the content of media and analysis of the distribution of media.

Analysis of content looks for the tropes, motifs, psychological intentions, unconscious historical influences, etc. of the media. Over Thanksgiving a friend of mine was arguing that the Scorpions were a dog whistle to white listeners because that band made a deliberate move to distance themselves from influence of black music on rock. Contrast this with Def Leppard. He reached this conclusion based by listening carefully to the beats and contextualizing them in historical conversations that were happening at the time.

Analysis of distribution looks at information flow and the systemic channels that shape it. How did the telegraph change patterns of communication? How did television? Radio? The Internet? Google? Facebook? Twitter? Ello? Who is paying for the distribution of this media? How far does the signal reach?

Each of these views is incomplete. Just as data underdetermines hypotheses, media underdetermines its interpretation. In both cases, a more complete understanding of the etiology of the data/media is needed to select between competing hypotheses. We can’t truly understand content unless we understand the channels through which it passes.

Analysis of distribution is more difficult than analysis of content because distribution is less visible. It is much easier to possess and study data/media than it is to possess and study the means of distribution. The means of distribution are a kind of capital. Those that study it from the outside must work hard to get anything better than a superficial view of it. Those on the inside work hard to get a deep view of it that stays up to date.

Part of the difficulty of analysis of distribution is that the system of distribution depends on the totality of information passing through it. Communication involves the dynamic engagement of both speakers and an audience. So a complete analysis of distribution must include an analysis of content for every piece of implicated content.

One thing that makes the content analysis necessary for analysis of distribution more difficult than what passes for content analysis simpliciter is that the former needs to take into account incorrect interpretation. Suppose you were trying to understand the popularity of Fascist propaganda in pre-WWII Germany and were interested in how the state owned the mass media channels. You could initially base your theory simply on how people were getting bombarded by the same information all the time. But you would at some point need to consider how the audience was reacting. Was it stirring feelings of patriotic national identity? Did they experience communal feelings with others sharing similar opinions? As propaganda provided interpretations of Shakespeare saying he was secretly a German and denunciation of other works as “degenerate art”, did the audience believe this content analysis? Did their belief in the propaganda allow them to continue to endorse the systems of distribution in which they took part?

This shows how the question of how media is interpreted is a political battle fought by many. Nobody fighting these battles is an impartial scientist. Since one gets an understanding of the means of distribution through impartial science, and since this understanding of the means of distribution is necessary for correct content analysis, we can dismiss most content analysis as speculative garbage, from a scientific perspective. What this kind of content analysis is instead is art. It can be really beautiful and important art.

On the other hand, since distribution analysis depends on the analysis of every piece of implicated content, distribution analysis is ultimately hopeless without automated methods for content analysis. This is one reason why machine learning techniques for analyzing text, images, and video are such a hot research area. While the techniques for optimizing supply chain logistics (for example) are rather old, the automated processing of media is a more subtle problem precisely because it involves the interpretation and reinterpretation by finite subjects.

By “finite subject” here I mean subjects that are inescapably limited by the boundaries of their own perspective. These limits are what makes their interpretation possible and also what makes their interpretation incomplete.