The Crevasse: a meditation on accountability of firms in the face of opacity as the complexity of scale

To recap:

(A1) Beneath corporate secrecy and user technical illiteracy, a fundamental source of opacity in “algorithms” and “machine learning” is the complexity of scale, especially scale of data inputs. (Burrell, 2016)

(A2) The opacity of the operation of companies using consumer data makes those consumers unable to engage with them as informed market actors. The consequence has been a “free fall” of market failure (Strandburg, 2013).

(A3) Ironically, this “free” fall has been “free” (zero price) for consumers; they appear to get something for nothing without knowing what has been given up or changed as a consequence (Hoofnagle and Whittington, 2013).


(B1) The above line of argument conflates “algorithms”, “machine learning”, “data”, and “tech companies”, as is common in the broad discourse. That this conflation is possible speaks to the ignorance of the scholarly position on these topics, and ignorance that is implied by corporate secrecy, technical illiteracy, and complexity of scale simultaneously. We can, if we choose, distinguish between these factors analytically. But because, from the standpoint of the discourse, the internals are unknown, the general indication of a ‘black box’ organization is intuitively compelling.

(B1a) Giving in to the lazy conflation is an error because it prevents informed and effective praxis. If we do not distinguish between a corporate entity and its multiple internal human departments and technical subsystems, then we may confuse ourselves into thinking that a fair and interpretable algorithm can give us a fair and interpretable tech company. Nothing about the former guarantees the latter because tech companies operate in a larger operational field.

(B2) The opacity as the complexity of scale, a property of the functioning of machine learning algorithms, is also a property of the functioning of sociotechnical organizations more broadly. Universities, for example, are often opaque to themselves, because of their own internal complexity and scale. This is because the mathematics governing opacity as a function of complexity and scale are the same in both technical and sociotechnical systems (Benthall, 2016).

(B3) If we discuss the complexity of firms, as opposed the the complexity of algorithms, we should conclude that firms that are complex due to scale of operations and data inputs (including number of customers) will be opaque and therefore have strategic advantage in the market against less complex market actors (consumers) with stiffer bounds on rationality.

(B4) In other words, big, complex, data rich firms will be smarter than individual consumers and outmaneuver them in the market. That’s not just “tech companies”. It’s part of the MO of every firm to do this. Corporate entities are “artificial general intelligences” and they compete in a complex ecosystem in which consumers are a small and vulnerable part.


(C1) Another source of opacity in data is that the meaning of data come from the causal context that generates it. (Benthall, 2018)

(C2) Learning causal structure from observational data is hard, both in terms of being data-intensive and being computationally complex (NP). (c.f. Friedman et al., 1998)

(C3) Internal complexity, for a firm, is not sufficient to be “all-knowing” about the data that is coming it; the firm has epistemic challenges of secrecy, illiteracy, and scale with respect to external complexity.

(C4) This is why many applications of machine learning are overrated and so many “AI” products kind of suck.

(C5) There is, in fact, an epistemic crevasse between all autonomous entities, each containing its own complexity and constituting a larger ecological field that is the external/being/environment for any other autonomy.

To do:

The most promising direction based on this analysis is a deeper read into transaction cost economics as a ‘theory of the firm’. This is where the formalization of the idea that what the Internet changed most are search costs (a kind of transaction cost) should be.

It would be nice if those insights could be expressed in the mathematics of “AI”.

There’s still a deep idea in here that I haven’t yet found the articulation for, something to do with autopoeisis.


Benthall, Sebastian. (2016) The Human is the Data Science. Workshop on Developing a Research Agenda for Human-Centered Data Science. Computer Supported Cooperative Work 2016. (link)

Sebastian Benthall. Context, Causality, and Information Flow: Implications for Privacy Engineering, Security, and Data Economics. Ph.D. dissertation. Advisors: John Chuang and Deirdre Mulligan. University of California, Berkeley. 2018.

Burrell, Jenna. “How the machine ‘thinks’: Understanding opacity in machine learning algorithms.” Big Data & Society 3.1 (2016): 2053951715622512.

Friedman, Nir, Kevin Murphy, and Stuart Russell. “Learning the structure of dynamic probabilistic networks.” Proceedings of the Fourteenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 1998.

Hoofnagle, Chris Jay, and Jan Whittington. “Free: accounting for the costs of the internet’s most popular price.” UCLA L. Rev. 61 (2013): 606.

Strandburg, Katherine J. “Free fall: The online market’s consumer preference disconnect.” U. Chi. Legal F. (2013): 95.


open source sustainability and autonomy, revisited

Some recent chats with Chris Holdgraf and colleagues at NYU interested in “critical digital infrastracture” have gotten me thinking again about the sustainability and autonomy of open source projects again.

I’ll admit to having had naive views about this topic in the past. Certainly, doing empirical data science work on open source software projects has given me a firmer perspective on things. Here are what I feel are the hardest earned insights on the matter:

  • There is tremendous heterogeneity in open source software projects. Almost all quantitative features of these projects fall in log-normal distributions. This suggests that the keys to open source software success are myriad and exogenous (how the technology fits in the larger ecosystem, how outside funding and recognition is accomplished, …) rather than endogenous factors (community policies, etc.) While many open source projects start as hobby and unpaid academic projects, those that go on to be successful find one or more funding sources. This funding is an exogenous factor.
  • The most significant exogenous factors to an open source software project’s success are the industrial organization of private tech companies. Developing an open technology is part of the strategic repertoire of these companies: for example, to undermine the position of a monopolist, developing an open source alternative decreases barriers to market entry and allows for a more competitive field in that sector. Another example: Google funded Mozilla for so long arguably to deflect antitrust action over Google Chrome.
  • There is some truth to Chris Kelty’s idea of open source communities as recursive publics, cultures that have autonomy that can assert political independence at the boundaries of other political forces. This autonomy comes from: the way developers of OSS get specific and valuable human capital in the process of working with the software and their communities; the way institutions begin to depend on OSS as part of their technical stack, creating an installed base; and how many different institutions may support the same project, creating competition for the scarce human capital of the developers. Essentially, at the point where the software and the skills needed to deploy it effectively and the community of people with those skills is self-organized, the OSS community has gained some economic and political autonomy. Often this autonomy will manifest itself in some kind of formal organization, whether a foundation, a non-profit, or a company like Redhat or Canonical or Enthought. If the community is large and diverse enough it may have multiple organizations supporting it. This is in principle good for the autonomy of the project but may also reflect political tensions that can lead to a schism or fork.
  • In general, since OSS development is internally most often very fluid, with the primary regulatory mechanism being the fork, the shape of OSS communities is more determined by exogenous factors than endogenous ones. When exogenous demand for the technology rises, the OSS community can find itself with a ‘surplus’, which can be channeled into autonomous operations.