Digifesto

Category: data science

Notes on fairness and nondiscrimination in machine learning

There has been a lot of work done lately on “fairness in machine learning” and related topics. It cannot be a coincidence that this work has paralleled a rise in political intolerance that is sensitized to issues of gender, race, citizenship, and so on. I more or less stand by my initial reaction to this line of work. But very recently I’ve done a deeper and more responsible dive into this literature and it’s proven to be insightful beyond the narrow problems which it purports to solve. These are some notes on the subject, ordered so as to get to the point.

The subject of whether and to what extent computer systems can enact morally objectionable bias goes back at least as far as Friedman and Nissenbaum’s 1996 article, in which they define “bias” as systematic unfairness. They mean this very generally, not specifically in a political sense (though inclusive of it). Twenty years later, Kleinberg et al. (2016) prove that there are multiple, competing notions of fairness in machine classification which generally cannot be satisfied all at once; they must be traded off against each other. In particular, a classifier that uses all available information to optimize accuracy–one that achieves what these authors call calibration–cannot also have equal false positive and false negative rates across population groups (read: race, sex), properties that Hardt et al. (2016) call “equal opportunity”. This is no doubt inspired by a now very famous ProPublica article asserting that a particular kind of commercial recidivism prediction software was “biased against blacks” because it had a higher false positive rate for black suspects than white offenders. Because bail and parole rates are set according to predicted recidivism, this led to cases where a non-recidivist was denied bail because they were black, which sounds unfair to a lot of people, including myself.

While I understand that there is a lot of high quality and well-intentioned research on this subject, I haven’t found anybody who could tell me why the solution to this problem was to stop using predicted recidivism to set bail, as opposed to futzing around with a recidivism prediction algorithm which seems to have been doing its job (Dieterich et al., 2016). Recidivism rates are actually correlated with race (Hartney and Vuong, 2009). This is probably because of centuries of systematic racism. If you are serious about remediating historical inequality, the least you could do is cut black people some slack on bail.

This gets to what for me is the most baffling aspect of this whole research agenda, one that I didn’t have the words for before reading Barocas and Selbst (2016). A point well-made by them is that the interpretation anti-discrimination law, which motivates a lot of this research, is fraught with tensions that complicate its application to data mining.

“Two competing principles have always undergirded anti-discrimination law: nondiscrimination and antisubordination. Nondiscrimination is the narrower of the two, holding that the responsibility of the law is to eliminate the unfairness individuals experience a the hands of decisionmakers’ choices due to membership in certain protected classes. Antisubordination theory, in contrast, holds that the goal of antidiscrimination law is, or at least should be, to eliminate status-based inequality due to membership in those classes, not as a matter of procedure, but substance.” (Barocas and Selbst, 2016)

More specifically, these two principles motivate different interpretations of the two pillars of anti-discrimination law, disparate treatment and disparate impact. I draw on Barocas and Selbst for my understanding of each:

A judgment of disparate treatment requires either a formal disparate treatment (across protected groups) of similarly situated people, or an intent to discriminate. Since in a large data mining application protected group membership will be proxied by many other factors, it’s not clear if the ‘formal’ requirement makes much sense here. And since machine learning applications only very rarely have racist intent, that option seems challengeable as well. While there are interpretations of these criteria that are tougher on decision-makers (i.e. unconscious intents), these seem to be motivated by antisubordination rather than the weaker nondiscrimination principle.

A judgment of disparate impact is perhaps more straightforward, but it can be mitigated in cases of “business necessity”, which (to get to the point) is vague enough to plausibly include optimization in a technical sense. Once again, there is nothing to see here from a nondiscrimination standpoint, though a nonsubordinationist would rather that these decision-makers have to take correcting for historical inequality into account.

I infer from their writing that Barocas and Selbst believe that nonsubordination is an important principle for nondiscrimination. In any case, they maintain that making the case for applying nondiscrimination laws to data mining effectively requires a commitment to “substantive remediation”. This is insightful!

Just to put my cards on the table: as much as I may like the idea of substantive remediation in principle, I personally don’t think that every application of nondiscrimination law needs to be animated by it. For many institutions, narrow nondiscrimination seems to be adequate if not preferable. I’d prefer remediation to occur through other specific policies, such as more public investment in schools in low-income districts. Perhaps for this reason, I’m not crazy about “fairness in machine learning” as a general technical practice. It seems to me to be trying to solve social problems with a technical fix, which despite being quite technical myself I don’t always see as a good idea. It seems like in most cases you could have a machine learning mechanism based on normal statistical principles (the learning step) and then use a decision procedure separately that achieves your political ends.

I wish that this research community (and here I mean more the qualitative research community surrounding it more than the technical community, which tends to define its terms carefully) would be more careful about the ways it talks about “bias”, because often it seems to encourage a conflation between statistical or technical senses of bias and political senses. The latter carry so much political baggage that it can be intimidating to try to wade in and untangle the two senses. And it’s important to do this untangling, because while bad statistical bias can lead to political bias, it can, depending on the circumstances, lead to either “good” or “bad” political bias. But it’s important, from the sake of numeracy (mathematical literacy) to understand that even if a statistically bad process has a politically “good” outcome, that is still, statistically speaking, bad.

My sense is that there are interpretations of nondiscrimination law that make it illegal to make certain judgments taking into account certain facts about sensitive properties like race and sex. There are also theorems showing that if you don’t take into account those sensitive properties, you are going to discriminate against them by accident because those sensitive variables are correlated with anything else you would use to judge people. As a general principle, while being ignorant may sometimes make things better when you are extremely lucky, in general it makes things worse! This should be a surprise to nobody.

References

Barocas, Solon, and Andrew D. Selbst. “Big data’s disparate impact.” (2016).

Dieterich, William, Christina Mendoza, and Tim Brennan. “COMPAS risk scales: Demonstrating accuracy equity and predictive parity.” Northpoint Inc (2016).

Friedman, Batya, and Helen Nissenbaum. “Bias in computer systems.” ACM Transactions on Information Systems (TOIS) 14.3 (1996): 330-347.

Hardt, Moritz, Eric Price, and Nati Srebro. “Equality of opportunity in supervised learning.” Advances in Neural Information Processing Systems. 2016.

Hartney, Christopher, and Linh Vuong. “Created equal: Racial and ethnic disparities in the US criminal justice system.” (2009).

Kleinberg, Jon, Sendhil Mullainathan, and Manish Raghavan. “Inherent trade-offs in the fair determination of risk scores.” arXiv preprint arXiv:1609.05807 (2016).

algorithmic law and pragmatist legal theory: Oliver Wendell Holmes Jr. “The Path of the Law”

Several months ago I was taken by the idea that in the future (and maybe depending on how you think about it, already in the present) laws should be written as computer algorithms. While the idea that “code is law” and that technology regulates is by no means original, what I thought perhaps provocative is the positive case for the (re-)implementation of the fundamental laws of the city or state in software code.

The argument went roughly like this:

  • Effective law must control a complex society
  • Effective control requires social and political prediciton.
  • Unassisted humans are not good at social and political prediction. For this conclusion I drew heavily on Philip Tetlock’s work in Expert Political Judgment.
  • Therefore laws, in order to keep pace with the complexity of society, should be implemented as technical systems capable of bringing data and machine learning to bear on social control.

Science fiction is full of both dystopias and utopias in which society is literally controlled by a giant, intelligent machine. Avoiding either extreme, I just want to make the modest point that there may be scalability problems with law and regulation based on discourse in natural language. To some extent the failure of the state to provide sophisticated, personalized regulation in society has created myriad opportunities for businesses to fill these roles. Now there’s anxiety about the relationship between these businesses and the state as they compete for social regulation. To the extent that businesses are less legitimate rulers of society than the state, it seems a practical, technical necessity that the state adopt the same efficient technologies for regulation that businesses have. To do otherwise is to become obsolete.

There are lots of reasons to object to this position. I’m interested in hearing yours and hope you will comment on this and future blog posts or otherwise contact me with your considered thoughts on the matter. To me the strongest objection is that the whole point of the law is that it is based on precedent, and so any claim about the future trajectory of the law has to be based on past thinking about the law. Since I am not a lawyer and I know precious little about the law, you shouldn’t listen to my argument because I don’t know what I’m talking about. Q.E.D.

My counterargument to this is that there’s lots of academics who opine about things they don’t have particular expertise in. One way to get away with this is by deferring to somebody else who has credibility in field of interest. This is just one of several reasons why I’ve been reading “The Path of the Law“, a classic essay about pragmatist legal theory written by Supreme Court Justice Oliver Wendell Holmes Jr. in 1897.

One of the key points of this essay is that it is a mistake to consider the study of law the study of morality per se. Rather, the study of law is the attempt to predict the decisions that courts will make in the future, based on the decisions courts will make in the past. What courts actually decide is based in part of legal precedent but also on the unconsciously inclinations of judges and juries. In ambiguous cases, different legal framings of the same facts will be in competition, and the judgment will give weight to one interpretation or another. Perhaps the judge will attempt to reconcile these differences into a single, logically consistent code.

I’d like to take up the arguments of this essay again in later blog posts, but for now I want to focus on the concept of legal study as prediction. I think this demands focus because while Holmes, like most American pragmatists, had a thorough and nuanced understanding of what prediction is, our mathematical understanding of prediction has come a long way since 1897. Indeed, it is a direct consequence of these formalizations and implementations of predictive systems that we today see so much tacit social regulation performed by algorithms. We know now that effective prediction depends on access to data and the computational power to process it according to well-known algorithms. These algorithms can optimize themselves to such a degree that their specific operations are seemingly beyond the comprehension of the people affected by them. Some lawyers have argued that this complexity should not be allowed to exist.

What I am pointing to is a fundamental tension between the requirement that practitioners of the law be able to predict legal outcomes, and the fact that the logic of the most powerful predictive engines today is written in software code not words. This is because of physical properties of computation and prediction that are not likely to ever change. And since a powerful predictive engine can just as easily use its power to be strategically unpredictable, this presents an existential challenge to the law. It may simply be impossible for lawyers acting as human lawyers have for hundreds of years to effectively predict and therefor regulate powerful computational systems.

One could argue that this means that such powerful computational systems should simply be outlawed. Indeed this is the thrust of certain lawyers. But if we believe that these systems are not going to go away, perhaps because they won’t allow us to regulate them out of existence, then our only viable alternative to suffering under their lawless control is to develop a competing system of computational legalism with the legitimacy of the state.

programming and philosophy of science

Philosophy of science is a branch of philosophy largely devoted to the demarcation problem: what is science?

I’ve written elsewhere about why and how in the social sciences, demarcation is highly politicized and often under attack. This is becoming pertinent now especially as computational methods become dominant across many fields and challenge the bases of disciplinary distinction. Today, a lot of energy (at UC Berkeley at least) goes into maintaining the disciplinary social sciences even when this creates social fields that are less scientific than they could be in order to maintain atavistic disciplinary traits.

Other energy (also at UC Berkeley, and elsewhere) goes into using computer programs to explore data about the social world in an undisciplinary way. This isn’t to say that specific theoretical lenses don’t inform these studies. Rather, the lenses are used provisionally and not in an exclusive way. This lack of disciplinary attachment is an important aspect of data science as applied to the social world.

One reason why disciplinary lenses are not very useful for the practicing data scientist is that, much like natural scientists, data scientists are more often than not engaged in technical inquiry whose purpose is prediction and control. This is very different from, for example, engaging an academic community in a conversation in a language they understand or that pays appropriate homage to a particular scholarly canon–the sort of thing one needs to do to be successful in an academic context. For much academic work, especially in the social sciences, the process of research publication, citation, and promotion is inherently political.

These politics are more often than not not an essential function to scientific inquiry itself; rather they have to do with the allocation of what Bourdieu calls temporal capital: grant funding, access, appointments, etc. within the academic field. Scientific capital, that symbolic capital awarded to scientists based on their contributions to trans-historical knowledge, is awarded more based on the success of an idea than by, for example, brown-nosing ones superiors. However, since temporal capital in the academy is organized by disciplines as a function of university bureaucratic organization, academic researchers are required to contort themselves to disciplinary requirements in the presentation of their work.

Contrast this with the work of analysing social data using computers. The tools used by computational social scientists tend to be products of the exact sciences (mathematics, statistics, computer science) with no further disciplinary baggage. The intellectual work of scientifically devising and testing theories against the data happens in a language most academic communities would not recognize as a language at all, and certainly not their language. While this work depends on the work of thousands of others who have built vast libraries of functional code, these ubiquitous contributors are not included in an social science discipline’s scholarly canon. They are uncited, taken for granted.

However, when those libraries are made openly available (and they often are), they participate in a larger open source ecosystem of tools whose merits are judged by their practical value. Returning to our theme of the demarcation problem, the question is: is this science?

I would answer: emphatically yes. Programming is science because, as Peter Naur has argued, programming is theory building (hat tip the inimitable Spiros Eliopoulos for the reference). The more deeply we look into the demarcation problem, the more clearly software engineering practice comes into focus as an extension of a scientific method of hypothesis generation and testing. Software is an articulation of ideas, and the combined works of software engineers are a cumulative science that has extended far beyond the bounds of the university.

data science is not positivist, it’s power

Naively, we might assume that contemporary ‘data science’ is a form of positivist or post-positivist science. The scientist gathers data and subsumes it under logical formulae–models with fitted parameters. Indeed this is the case when data science is applied to natural phenomena, such as stars or the human genome.

The question of what kind of science ‘data science’ is becomes much more complex when we start to look at its application to social phenomena. This includes its application to the management of industrial and commercial technology–the so called “Internet of Things“. (Technology in general, and especially technology as situated socially, being a social phenomenon.)

There are (at least) two reasons why data science in these social domains is not strictly positivist.

The first is that, according to McKinsey’s Michael Chui, data science in the Internet of Things context is main about either real-time control or anomaly detection. Neither of these depends on the kind of nomothetic orientation that positivism requires. The former requires only an objective function over inputs to guide the steering of the dynamic system. The latter requires only the detection of deviation from historically observed patterns.

‘Data science’ applied in this context isn’t actually about the discovery of knowledge at all. It is not, strictly speaking, a science. Rather, it is a process through which the operations of existing technologies are related and improved by further technological interventions. Robust positivist engineering knowledge is applied to these cases. But however much the machines may ‘learn’, what they learn is not propositional.

Perhaps the best we can say is that ‘data science’ in this context is the science of techniques for making these kinds of interventions. As learning these techniques depends on mathematical rigor and empirical prototyping, we can say perhaps of the limited sense of ‘pure’ (not applied) data science that it is a positivist science.

But the second reason why data science is not positivist comes about as a result of its application. The problem is that when systems controlled by complex computational processes interact, the result is a more complex system. In adversarial cases, the interacting complex systems become the subject matter of cybersecurity research, towards which data science is one application. But as soon as on starts to study phenomena that are aware of the observer and can act in ways that respond to its presence, you get out of positivist territory.

A better way to think about data science might be to think of it in terms of perception. In, the visual system, data that comes in through the eye goes through many steps of preprocessing before it becomes the subject of attention. Visual representations feed into the control mechanisms of movement.

If we see data science not as a positivist attempt to discover natural laws, but rather as an extension of agency by expanding powers of perception and training skillful control, then we can get a picture of data science that’s consistent with theories of situated and embodied cognition.

These theories of situated and embodied cognition are perhaps the best contenders for what can displace the dominant paradigm as imagined by critics of cognitive science, economics, etc. Rather than being a rejection of explanatory power of naturalistic theories of information processing, these theories extend naive theories to embrace the complexity of how agents cognition is situated in a body in time, space, and society.

If we start to think of ‘data science’ not as a kind of natural science but as the techniques and tools for extending the information processing that is involved in ones individual or collective agency, then we can start to think about data science as what it really is: power.

Fascinated by Vijay Narayanan’s talk at #DataEDGE

As I write this I’m watching Vijay Narayanan’s, Director of Algorithms and Data Science Solutions at Microsoft, talk at the DataEDGE conference at UC Berkeley.

The talk is about “The Data Science Economy.” It began with a history of the evolution of the human centralized nervous system. He then went on to show the centralizing trend of the data economy. Data collection will be become more mobile, data processing will be done in the cloud. This data will be sifted by software and used to power a marketplace of services, which ultimately deliver intelligence to their users.

It was wonderful to see somebody so in the know reaffirming what has been a suspicion I’ve had since starting graduate school but have found little support for in the academic setting. The suspicion is that what’s needed to accurately model the data science economy is a synthesis of cognitive science and economics that can show the comparative market value and competitiveness of different services.

This is not out of the mainline of information technology, management science, computer science, and other associated disciplines that have been at the nexus of business and academia for 70 years. It’s an intellectual tradition that’s rooted in the 1940’s cybernetics vision of Norbert Wiener and was going strong in the social sciences as late as Beniger‘s The Control Revolution, which, like Narayanan, draws an explicit connection between information processing in the brain and information processing in the microprocessor–notably while acknowledging the intermediary step of bureaucracy as a large-scale information processing system.

There’s significant cross-pollination between engineering, economics, computer science, and cognitive psychology. I’ve read papers from, say, the Education field in the late 80’s and early 90’s that refers to this collectively as “the dominant paradigm”. At UC Berkeley today, it’s fascinating to see a departmental politics play out over ‘data science’ that echoes some of these concerns that a powerful alliance of ideas are getting mobilized by industry and governments while other disciplines are struggling to find relevance.

It’s possible that these specialized disciplinary discourses are important for the cultivation of thought that is important for its insight despite being fundamentally impractical. I’m coming to a different view: that maybe the ‘dominant paradigm’ is dominant because it is scientifically true, and that other disciplinary orientations are suffering because they are based on unsound theory. If disciplines that are ‘dominated’ by another paradigm are floundering because they are, to put it simply, wrong, then that is a very elegant explanation for what’s going on.

The ramification of this is that what’s needed is not a number of alternatives to ‘the dominant paradignm’. What’s needed is that scholars double down on the dominant paradigm and learn how to express in its logic the complexities and nuances that the other disciplines have been designed to capture. What we can hope for, in terms of intellectual continuity, is the preservation of what’s best of older ideas in a creative synthesis with the foundational principles of computer science and mathematical biology.

causal inference in networks is hard

I am trying to make statistically valid inferences about the mechanisms underlying observational networked data and it is really hard.

Here’s what I’m up against:

  • Even though my data set is a complete ecologically valid data set representing a lot of real human communication over time, it (tautologically) leaves out everything that it leaves out. I can’t even count all the latent variables.
  • The best methods for detecting causal mechanism, the potential outcomes framework for Rubin model, depends on the assumption that different members of the sample don’t interfere. But I’m working with networked data. Everything interferes with everything else, at least indirectly. That’s why it’s a network.
  • Did I mention that I’m working with communications data? What’s interesting about human communication is that it’s not really generated at random at all. It’s very deliberately created by people acting more or less intelligently all the time. If the phenomenon I’m studying is not more complex than the models I’m using to study it, then there is something seriously wrong with the people I’m studying.

I think I can deal with the first point here by gracefully ignoring it. It may be true that any apparent causal effect in my data is spurious and due to a common latent cause upstream. It may be true that the variance in the data is largely due to exogenous factors. Fine. That’s noise. I’m looking for a reliable endogenous signal. If there isn’t something there that would suggest that my entire data set is epiphenomal. But I know it’s not. So there’s got to be something there.

For the second point, there are apparently sophisticated methods for extending the potential outcomes framework to handling peer effects. These are gnarly and though I figure I could work with them, I don’t think they are going to be what I need because I’m not really looking for a causal relationship like a statistical relationship between treatment and outcome. I’m not after in the first instance what might be called type causation. I’m rather trying to demonstrate cases of token causation where causation is literally the transfer of information from object to another. And then I’m trying to show regularity in this underlying kind of causation in a layer of abstraction over it.

The best angle I can come up with on this so far is to use emergent properties of the network like degree assortativity to sort through potential mathematically defined graph generation algorithms. These algorithms can act as alternative hypotheses, and the observed emergent properties can theoretically be used to compute the likelihood of the observed data given the generation methods. Then all I need is a prior over graph generation methods! It’s perfectly Bayesian! I wonder if it is at all feasible to execute on. I will try.

It’s not 100% clear how you can take an algorithmically defined process and turn that into a hypothesis about causal mechanisms. Theoretically, as long as a causal network has computable conditional dependencies it can be represented by and algorithm. I believe that any algorithm (in the Church/Turing sense) can be represented as a causal network. Can this be done elegantly, so that the corresponding causal network represents something like what we’d expect from the scientific theory on the matter? This is unclear because, again, Pearl’s causal networks are great at representing type causation but not as expressive of token causation among a large population of uniquely positioned, generatively produced stuff. Pearl is not good at modeling life, I think.

The strategic activity of the actors is a modeling challenge but I think this is actually where there is substantive potential in this kind of research. If effective strategic actors are working in a way that is observably different from naive actors in some way that’s measurable in aggregate behavior, that’s a solid empirical result! I have some hypotheses around this that I think are worth checking. For example, probably the success of an open source community depends in part on whether members of the community act in ways that successfully bring new members in. Strategies that cultivate new members are going to look different from strategies that exclude newcomers or try to maintain a superior status. Based on some preliminary results, it looks like this difference between successful open source projects and most other social networks is observable in the data.

Know-how is not interpretable so algorithms are not interpretable

I happened upon Hildreth and Kimble’s “The duality of knowledge” (2002) earlier this morning while writing this and have found it thought-provoking through to lunch.

What’s interesting is that it is (a) 12 years old, (b) a rather straightforward analysis of information technology, expert systems, ‘knowledge management’, etc. in light of solid post-Enlightenment thinking about the nature of knowledge, and (c) an anticipation of the problems of ‘interpretability’ that were a couple months ago at least an active topic of academic discussion. Or so I hear.

This is the paper’s abstract:

Knowledge Management (KM) is a field that has attracted much attention both in academic and practitioner circles. Most KM projects appear to be primarily concerned with knowledge that can be quantified and can be captured, codified and stored – an approach more deserving of the label Information Management.

Recently there has been recognition that some knowledge cannot be quantified and cannot be captured, codified or stored. However, the predominant approach to the management of this knowledge remains to try to convert it to a form that can be handled using the ‘traditional’ approach.

In this paper, we argue that this approach is flawed and some knowledge simply cannot be captured. A method is needed which recognises that knowledge resides in people: not in machines or documents. We will argue that KM is essentially about people and the earlier technology driven approaches, which failed to consider this, were bound to be limited in their success. One possible way forward is offered by Communities of Practice, which provide an environment for people to develop knowledge through interaction with others in an environment where knowledge is created nurtured and sustained.

The authors point out that Knowledge Management (KM) is an extension of the earlier program of Artificial Intelligence, depends on a model of knowledge that maintains that knowledge can be explicitly represented and hence stored and transfered, and propose an alternative way of thinking about things based on the Communities of Practice framework.

A lot of their analysis is about the failures of “expert systems”, which is a term that has fallen out of use but means basically the same thing as the contemporary uncomputational scholarly use of ‘algorithm’. An expert system was a computer program designed to make decisions about things. Broadly speaking, a search engine is a kind of expert system. What’s changed are the particular techniques and algorithms that such systems employ, and their relationship with computing and sensing hardware.

Here’s what Hildreth and Kimble have to say about expert systems in 2002:

Viewing knowledge as a duality can help to explain the failure of some KM initiatives. When the harder aspects are abstracted in isolation the representation is incomplete: the softer aspects of knowledge must also be taken into account. Hargadon (1998) gives the example of a server holding past projects, but developers do not look there for solutions. As they put it, ‘the important knowledge is all in people’s heads’, that is the solutions on the server only represent the harder aspects of the knowledge. For a complete picture, the softer aspects are also necessary. Similarly, the expert systems of the 1980s can be seen as failing because they concentrated solely on the harder aspects of knowledge. Ignoring the softer aspects meant the picture was incomplete and the system could not be moved from the environment in which it was developed.

However, even knowledge that is ‘in people’s heads’ is not sufficient – the interactive aspect of Cook and Seely Brown’s (1999) ‘knowing’ must also be taken into account. This is one of the key aspects to the management of the softer side to knowledge.

In 2002, this kind of argument was seen as a valuable critique of artificial intelligence and the practices based on it as a paradigm. But already by 2002 this paradigm was falling away. Statistical computing, reinforcement learning, decision tree bagging, etc. were already in use at this time. These methods are “softer” in that they don’t require the “hard” concrete representations of the earlier artificial intelligence program, which I believe by that time was already refered to as “Good Old Fashioned AI” or GOFAI by a number of practicioners.

(I should note–that’s a term I learned while studying AI as an undergraduate in 2005.)

So throughout the 90’s and the 00’s, if not earlier, ‘AI’ transformed into ‘machine learning’ and become the implementation of ‘soft’ forms of knowledge. These systems are built to learn to perform a task optimally based flexibly on feedback from past performance. They are in fact the cybernetic systems imagined by Norbert Wiener.

Perplexing, then, is the contemporary problem that the models created by these machine learning algorithms are opaque to their creators. These models were created using techniques that were designed precisely to solve the problems that systems based on explicit, communicable knowledge were meant to solve.

If you accept the thesis that contemporary ‘algorithms’-driven systems are well-designed implementations of ‘soft’ knowledge systems, then you get some interesting conclusions.

First, forget about interpeting the learned models of these systems and testing them for things like social discrimination, which is apparently in vogue. The right place to focus attention is on the function being optimized. All these feedback-based systems–whether they be based on evolutionary algorithms, or convergence on local maxima, or reinforcement learning, or whatever–are designed to optimize some goal function. That goal function is the closest thing you will get to an explicit representation of the purpose of the algorithm. It may change over time, but it should be coded there explicitly.

Interestingly, this is exactly the sense of ‘purpose’ that Wiener proposed could be applied to physical systems in his landmark essay, published with Rosenbleuth and Bigelow, “Purpose, Behavior, and Teleology.” In 1943. Sly devil.

EDIT: An excellent analysis of how fairness can be represented as an explicit goal function can be found in Dwork et al. 2011.

Second, because what the algorithms is designed to optimize is generally going to be something like ‘maximize ad revenue’ and not anything particularly explicitly pernicious like ‘screw over the disadvantaged people’, this line of inquiry will raise some interesting questions about, for example, the relationship between capitalism and social justice. By “raise some interesting questions”, I mean, “reveal some uncomfortable truths everyone is already aware of”. Once it becomes clear that the whole discussion of “algorithms” and their inscrutability is just a way of talking about societal problems and entrenched political interests without talking about it, it will probably be tabled due to its political infeasibility.

That is (and I guess this is the third point) unless somebody can figure out how to explicitly define the social justice goals of the activists/advocates into a goal function that could be implemented by one of these soft-touch expert systems. That would be rad. Whether anybody would be interested in using or investing in such a system is an important open question. Not a wide open question–the answer is probably “Not really”–but just open enough to let some air onto the embers of my idealism.

Horkheimer and Wiener

[I began writing this weeks ago and never finished it. I’m posting it here in its unfinished form just because.]

I think I may be condemning myself to irrelevance by reading so many books. But as I make an effort to read up on the foundational literature of today’s major intellectual traditions, I can’t help but be impressed by the richness of their insight. Something has been lost.

I’m currently reading Norbert Wiener’s The Human Use of Human Beings (1950) and Max Horkheimer’s Eclipse of Reason (1947). The former I am reading for the Berkeley School of Information Classics reading group. Norbert Wiener was one of the foundational mathematicians of 20th century information technology, a colleague of Claude Shannon. Out of his own sense of social responsibility, he articulated his predictions for the consequences of the technology he developed in Human Use. This work was the foundation of cybernetics, an influential school of thought in the 20th century. Terrell Bynum, in his Stanford Encyclopedia of Philosophy article on “Computer and Information Ethics“, attributes to Wiener’s cybernetics the foundation of all future computer ethics. (I think that the threads go back earlier, at least through to Heidegger’s Question Concerning Technology. (EDIT: Actually, QCT was published, it seems, in 1954, after Weiner’s book.)) It is hard to find a straight answer to the question of what happened to cybernetics?. By some reports, the artificial intelligence community cut their NSF funding in the 60’s.

Horkheimer is one of the major thinkers of the very influential Frankfurt School, the postwar social theorists at the core of intellectual critical theory. Of the Frankfurt School, perhaps the most famous in the United States is Adorno. Adorno is also the most caustic and depressed, and unfortunately much of popular critical theory now takes on his character. Horkheimer is more level-headed. Eclipse of Reason is an argument about the ways that philosophical empiricism and pragmatism became complicit in fascism. Here is an interested quotation.

It is very interesting to read them side by side. Published only a few years apart, Wiener and Horkheimer are giants of two very different intellectual traditions. There’s little reason to expect they ever communicated (a more thorough historian would know more). But each makes sweeping claims about society, language, and technology and contextualizes them in broader intellectual awareness of religion, history and science.

Horkheimer writes about how the collapse of the Enlightment project of objective reason has opened the way for a society ruled by subjective reason, which he characterizes as the reason of formal mathematics and scientific thinking that is neutral to its content. It is instrumental thinking in its purest, most rigorous form. His descriptions of it sound like gestures to what we today call “data science”–a set of mechanical techniques that we can use to analyze and classify anything, perfecting our understanding of technical probabilities towards whatever ends one likes.

I find this a more powerful critique of data science than recent paranoia about “algorithms”. It is frustrating to read something over sixty years old that covers the same ground as we are going over again today but with more composure. Mathematized reasoning about the world is an early 20th century phenomenon and automated computation a mid-20th century phenomenon. The disparities in power that result from the deployment of these tools were thoroughly discussed at the time.

But today, at least in my own intellectual climate, it’s common to hear a mention of “logic” with the rebuttal “whose logic?“. Multiculturalism and standpoint epistemology, profoundly important for sensitizing researchers to bias, are taken to an extreme the glorifies technical ignorance. If the foundation of knowledge is in one’s lived experience, as these ideologies purport, and one does not understand the technical logic used so effectively by dominant identity groups, then one can dismiss technical logic as merely a cultural logic of an opposing identity group. I experience the technically competent person as the Other and cannot perceive their actions as skill but only as power and in particular power over me. Because my lived experience is my surest guide, what I experience must be so!

It is simply tragic that the education system has promoted this kind of thinking so much that it pervades even mainstream journalism. This is tragic for reasons I’ve expressed in “objectivity is powerful“. One solution is to provide more accessible accounts of the lived experience of technicality through qualitative reporting, which I have attempted in “technical work“.

But the real problem is that the kind of formal logic that is at the foundation of modern scientific thought, including its most recent manifestation ‘data science’, is at its heart perfectly abstract and so cannot be captured by accounts of observed practices or lived experience. It is reason or thought. Is it disembodied? Not exactly. But at least according to constructivist accounts of mathematical knowledge, which occupy a fortunate dialectical position in this debate, mathematical insight is built from embodied phenomenological primitives but by their psychological construction are abstract. This process makes it possible for people to learn abstract principles such as the mathematical theory of information on which so much of the contemporary telecommunications and artificial intelligence apparatus depends. These are the abstract principles with which the mathematician Norbert Wiener was so intimately familiar.

textual causation

A problem that’s coming up for me as a data scientist is the problem of textual causation.

There has been significant interesting research into the problem of extracting causal relationships between things in the world from text about those things. That’s an interesting problem but not the problem I am talking about.

I am talking about the problem of identifying when a piece of text has been the cause of some event in the world. So, did the State of the Union address affect the stock prices of U.S. companies? Specifically, did the text of the State of the Union address affect the stock price? Did my email cause my company to be more productive? Did specifically what I wrote in the email make a difference?

A trivial example of textual causation (if I have my facts right–maybe I don’t) is the calculation of Twitter trending topics. Millions of users write text. That text is algorithmically scanned and under certain conditions, Twitter determines a topic to be trending and displays it to more users through its user interface, which also uses text. The user interface text causes thousands more users to look at what people are saying about the topic, increasing the causal impact of the original text. And so on.

These are some challenges to understanding the causal impact of text:

  • Text is an extraordinarily high-dimensional space with tremendous irregularity in distribution of features.
  • Textual events are unique not just because the probability of any particular utterance is so low, but also because the context of an utterance is informed by all the text prior to it.
  • For the most part, text is generated by a process of unfathomable complexity and interpreted likewise.
  • A single ‘piece’ of text can appear and reappear in multiple contexts as distinct events.

I am interested in whether it is possible to get a grip on textual causation mathematically and with machine learning tools. Bayesian methods theoretically can help with the prediction of unique events. And the Pearl/Rubin model of causation is well integrated with Bayesian methods. But is it possible to use the Pearl/Rubin model to understand unique events? The methodological uses of Pearl/Rubin I’ve seen are all about establishing type causation between independent occurrences. Textual causation appears to be as a rule a kind of token causation in a deeply integrated contextual web.

Perhaps this is what makes the study of textual causation uninteresting. If it does not generalize, then it is difficult to monetize. It is a matter of historical or cultural interest.

But think about all the effort that goes into communication at, say, the operational level of an organization. How many jobs require “excellent communication skills.” A great deal of emphasis is placed not only on that communication happens, but how people communicate.

One way to approach this is using the tools of linguistics. Linguistics looks at speech and breaks it down into components and structures that can be scientifically analyzed. It can identify when there are differences in these components and structures, calling these differences dialects or languages.