Digifesto

Category: academia

frustrations with machine ethics

It’s perhaps because of the contemporary two cultures problem of tech and the humanities that machine ethics is in such a frustrating state.

Today I read danah boyd’s piece in The Message about technology as an arbiter of fairness. It’s more baffling conflation of data science with neoliberalism. This time, the assertion was that the ideology of the tech industry is neoliberalism hence their idea of ‘fairness’ is individualist and against social fabric. It’s not clear what backs up these kinds of assertions. They are more or less refuted by the fact that industrial data science is obsessed with our network of ties for marketing reasons. If anybody understands the failure of the myth of the atomistic individual, it’s “tech folks,” a category boyd uses to capture, I guess, everyone from marketing people at Google to venture capitalists to startup engineers to IBM researchers. You know, the homogenous category that is “tech folks.”

This kind of criticism makes the mistake of thinking that a historic past is the right way to understand a rapidly changing present that is often more technically sophisticated than the critics understand. But critical academics have fallen into the trap of critiquing neoliberalism over and over again. One problem is that tech folks don’t spend a ton of time articulating their ideology in ways that are convenient for pop culture critique. Often their business models require rather sophisticated understandings of the market, etc. that don’t fit readily into that kind of mold.

What’s needed is substantive progress in computational ethics. Ok, so algorithms are ethically and politically important. What politics would you like to see enacted, and how do you go about implementing that? How do you do it in a way that attracts new users and is competitively funded so that it can keep up with the changing technology with which we use to access the web? These are the real questions. There is so little effort spent trying to answer them. Instead there’s just an endless series of op-ed bemoaning the way things continue to be bad because it’s easier than having agency about making things better.

more on algorithms, judgment, polarization

I’m still pondering the most recent Tufekci piece about algorithms and human judgment on Twitter. It prompted some grumbling among data scientists. Sweeping statements about ‘algorithms’ do that, since to a computer scientist ‘algorithm’ is about as general a term as ‘math’.

In later conversation, Tufekci clarified that when she was calling out the potential problems of algorithmic filtering of the Twitter newsfeed, she was speaking to the problems of a newsfeed curated algorithmically for the sake of maximizing ‘engagement’. Or ads. Or, it is apparent on a re-reading of the piece, new members. She thinks an anti-homophily algorithm would maybe be a good idea, but that this is so unlikely according to the commercial logic of Twitter to be a marginal point. And, meanwhile, she defends ‘human prioritizatin’ over algorithmic curation, despite the fact that homophily (not to mention preferential attachment) are arguable negative consequences of social system driven by human judgment.

I think inquiry into this question is important, but bound to be confusing to those who aren’t familiar in a deep way with network science, machine learning, and related fields. It’s also, I believe, helpful to have a background in cognitive science, because that’s a field which maintains that human judgment and computational systems are doing fundamentally commensurable kinds of work. When we think in sophisticated way about crowdsourced labor, we use this sort of thinking. We acknowledge, for example, that human brains are better at the computational task of image recognition, so then we employ Turkers to look at and label images. But those human judgments are then inputs to statistical proceses that verify and check those judgments against each other. Later, those determinations that result from a combination of human judgment and algorithmic processing could be used in a search engine–which returns answers to questions based on human input. Search engines, then, are also a way of combining human and purely algorithmic judgment.

What it comes down to is that virtually all of our interactions with the internet are built around algorithmic affordances. And these systems can be understood systematically if we reject the quantitative/qualitative divide at the ontological level. Reductive physicalism entails this rejection, but–and this is not to be underestated–pisses or alienates people who do qualitative or humanities research.

This is old news. C.P. Snow’s The Two Cultures. The Science Wars. We’ve been through this before. Ironically, the polarization is algorithmically visible in the contemporary discussion about algorithms.*

The Two Cultures on Twitter?

It’s I guess not surprising that STS and cultural studies academics are still around and in opposition to the hard scientists. What’s maybe new is how much computer science now affects the public, and how the popular press appears to have allied itself with the STS and cultural studies view. I guess this must be because cultural anthropologists and media studies people are more likely to become journalists and writers, whereas harder science is pretty abstruse.

There’s an interesting conflation now from the soft side of the culture wars of science with power/privilege/capitalism that plays out again and again. I bump into it in the university context. I read about it all the time. Tufekci’s pessimism that the only algorihtmic filtering Twitter would adopt would be one that essentially obeys the logic “of Wall Street” is, well, sad. It’s sad that an unfortunate pairing that is analytically contingent is historically determined to be so.

But there is also something deeply wrong about this view. Of course there are humanitarian scientists. Of course there is a nuanced center to the science wars “debate”. It’s just that the tedious framing of the science wars has been so pervasive and compelling, like a commercial jingle, that it’s hard to feel like there’s an audience for anything more subtle. How would you even talk about it?

* I need to confess: I think there was some sloppiness in that Medium piece. If I had had more time, I would have done something to check which conversations were actually about the Tufekci article, and which were just about whatever. I feel I may have misrepresented this in the post. For the sake of accessibility or to make the point, I guess. Also, I’m retrospectively skittish about exactly how distinct a cluster the data scientists were, and whether its insularity might have been an artifact of the data collection method. I’ve been building out poll.emic in fits mainly as a hobby. I built it originally because I wanted to at last understand Weird Twitter’s internal structure. The results were interesting but I never got to writing them up. Now I’m afraid that the culture has changed so much that I wouldn’t recognize it any more. But I digress. Is it even notable that social scientists from different disciplines would have very different social circles around them? Is the generalization too much? And are there enough nodes in this graph to make it a significant thing to say about anything, really? There could be thousands of academic tiffs I haven’t heard about that are just as important but which defy my expectations and assumptions. Or is the fact that Medium appears to have endorsed a particular small set of public intellectuals significant? How many Medium readers are there? Not as many as there are Twitter users, by several orders of magnitude, I expect. Who matters? Do academics matter? Why am I even studying these people as opposed to people who do more real things? What about all the presumabely sane and happy people who are not pathologically on the Internet? Etc.

a response to “Big Data and the ‘Physics’ of Social Harmony” by @doctaj; also Notes towards ‘Criticality as ideology';

I’ve been thinking over Robin James’ “Big Data & the ‘Physics’ of Social Harmony“, an essay in three sections. The first discusses Singapore’s use of data science to detect terrorists and public health threats for the sake of “social harmony,” as reported by Harris in Foreign Policy. The second ties together Plato, Pentland’s “social physics”, and neoliberalism. The last discusses the limits to individual liberty proposed by J.S. Mill. The author admits it’s “all over the place.” I get the sense that it is a draft towards a greater argument. It is very thought-provoking and informative.

I take issue with a number of points in the essay. Underlying my disagreement is what I think is a political difference about the framing of “data science” and its impact on society. Since I am a data science practitioner who takes my work seriously, I would like this framing to be nuanced, recognizing both the harm and help that data science can do. I would like the debate about data science to be more concrete and pragmatic so that practitioners can use this discussion as a guide to do the right thing. I believe this will require discussion of data science in society to be informed by a technical understanding of what data science is up to. However, I think it’s also very important that these discussions rigorously take up the normative questions surrounding data sciences’ use. It’s with this agenda that I’m interested in James’ piece.

James is a professor of Philosophy and Women’s/Gender Studies and the essay bears the hallmarks of these disciplines. Situated in a Western and primarily anglophone intellectual tradition, it draws on Plato and Mill for its understanding of social harmony and liberalism. At the same time, it has the political orientation common to Gender Studies, alluding to the gendered division of economic labor, at times adopting Marxist terminology, and holding suspicion for authoritarian power. Plato is read as being the intellectual root of a “particular neoliberal kind of social harmony” that is “the ideal that informs data science.” James contrasts this ideal with the ideal of individual liberty, as espoused and then limited by Mill.

Where I take issue with James is that I think this line of argument is biased by its disciplinary formation. (Since this is more or less a truism for all academics, I suppose this is less a rebuttal than a critique.) Where I believe this is most visible is in her casting of Singapore’s ideal of social harmony as an upgrade of Plato, via the ideology of neoliberalism. She does not not consider in the essay that Singapore’s ideal of social harmony might be rooted in Eastern philosophy, not Western philosophy. Though I have no special access or insight into the political philosophy of Singapore, this seems to me to be an important omission given that Singapore is ethnically 74.2% Chinese and with Buddhist plurality.

Social harmony is a central concept in Eastern, especially Chinese, philosophy with deep roots in Confucianism and Daoism. A great introduction for those with background in Western philosophy who are interested in the philosophical contributions of Confucius is Fingarette’s Confucius: The Secular as Sacred. Fingarette discusses how Confucian thought is a reaction to the social upheaval and war of Anciant China’s Warring States Period, roughly 475 – 221 BC. Out of these troubling social conditions, Confucian thought attempts to establish conditions for peace. These include ritualized forms of social interaction at whose center is a benevolent Emperor.

There are many parallels with Plato’s political philosophy, but Fingarette makes a point of highlighting where Confucianism is different. In particular, the role of social ritual and ceremony as the basis of society is at odds with Western individualism. Political power is not a matter of contest of wills but the proper enactment of communal rites. It is like a dance. Frequently, the word “harmony” is used in the translation of Confucian texts to refer to the ideal of this functional, peaceful ceremonial society and, especially, its relationship with nature.

A thorough analysis of use of data science for social control in light of Eastern philosophy would be an important and interesting work. I certainly haven’t done it. My point is simply that when we consider the use of data science for social control as a global phenomenon, it is dubious to see it narrowly in light of Western intellectual history and ideology. That includes rooting it in Plato, contrasting it with Mill, and characterizing it primarily as an expression of white neoliberalism. Expansive use of these Western tropes is a projection, a fallacy of “I think this way, therefore the world must.” This I submit is an occupational hazard of anyone who sees their work primarily as an analysis of critique of ideology.

In a lecture in 1965 printed in Knowledge and Human Interests, Habermas states:

The concept of knowledge-constitutive human interests already conjoins the two elements whose relation still has to be explained: knowledge and interest. From everyday experience we know that ideas serve often enough to furnish our actions with justifying motives in place of the real ones. What is called rationalization at this level is called ideology at the level of collective action. In both cases the manifest content of statements is falsified by consciousness’ unreflected tie to interests, despite its illusion of autonomy. The discipline of trained thought thus correctly aims at excluding such interests. In all the sciences routines have been developed that guard against the subjectivity of opinion, and a new discipline, the sociology of knowledge, has emerged to counter the uncontrolled influence of interests on a deeper level, which derive less from the individual than from the objective situation of social groups.

Habermas goes on to reflect on the interests driving scientific inquiry–“scientific” in the broadest sense of having to do with knowledge. He delineates:

  • Technical inquiry motivated by the drive for manipulation and control, or power
  • Historical-hermeneutic inquiry motivated by the drive to guide collective action
  • Critical, reflexive inquiry into how the objective situation of social groups controls ideology, motivated by the drive to be free or liberated

This was written in 1965. Habermas was positioning himself as a critical thinker; however, unlike some of the earlier Frankfurt School thinkers he drew on, he did maintained that technical power was an objective human interest. (see Bohman and Rehg) In the United States especially, criticality as a mode of inquiry took aim at the ideologies that aimed at white, bourgeois, and male power. Contemporary academic critique has since solidified as an academic discipline and wields political power. In particular, is frequently enlisted as an expression of the interests of marginalized groups. In so doing, academic criticality has (in my view regrettably) becomes mere ideology. No longer interested in being scientifically disinterested, it has become a tool of rationalization. It’s project is the articulation of changing historical conditions in certain institutionally recognized tropes. One of these tropes is the critique of capitalism, modernism, neoliberalism, etc. and their white male bourgeois heritage. Another is the feminist emphasis on domesticity as a dismissed form on economic production. This trope features in James’ analysis of Singapore’s ideal of social harmony:

Harris emphasizes that Singaporeans generally think that finely-tuned social harmony is the one thing that keeps the tiny city-state from tumbling into chaos. [1] In a context where resources are extremely scarce–there’s very little land, and little to no domestic water, food, or energy sources, harmony is crucial. It’s what makes society sufficiently productive so that it can generate enough commercial and tax revenue to buy and import the things it can’t cultivate domestically (and by domestically, I really mean domestically, as in, by ‘housework’ or the un/low-waged labor traditionally done by women and slaves/servants.) Harmony is what makes commercial processes efficient enough to make up for what’s lost when you don’t have a ‘domestic’ supply chain. (emphasis mine)

To me, this parenthetical is quite odd. There are other uses of the word “domestic” that do not specifically carry the connotation of women and slave/servants. For example, the economic idea of gross domestic product just means “an aggregate measure of production equal to the sum of the gross values added of all resident institutional units engaged in production (plus any taxes, and minus any subsidies, on products not included in the value of their outputs).” Included in that production is work done by men and high-wage laborers. To suggest that natural resources are primarily exploited by “domestic” labor in the ‘housework’ sense is bizarre given, say, agribusiness, industrial mining, etc.

There is perhaps an interesting etymological relationship here; does our use of ‘domestic’ in ‘domestic product’ have its roots in household production? I wouldn’t know. Does that same etymological root apply in Singapore? Was agriculture in East Asia traditionally the province of household servants in China and Southeast Asia (as opposed to independent farmers and their sons?)? Regardless, domestic economic production agricultural production is not housework now. So it’s mysterious that this detail should play a role in explaining Singapore’s emphasis on social harmony today.

So I think it’s safe to say that this parenthetical remark by James is due to her disciplinary orientation and academic focus. Perhaps it is a contortion to satisfy the audience of Cyborgology, which has a critical left-leaning politics. A Harris’s original article does not appear to support this interpretation. Rather, it only uses the word ‘harmony’ twice, and maintains a cultural sensitivity that James’ piece lacks, noting that Singapore’s use of data science may be motivated by a cultural fear of loss or risk.

The colloquial word kiasu, which stems from a vernacular Chinese word that means “fear of losing,” is a shorthand by which natives concisely convey the sense of vulnerability that seems coded into their social DNA (as well as their anxiety about missing out — on the best schools, the best jobs, the best new consumer products). Singaporeans’ boundless ambition is matched only by their extreme aversion to risk.

If we think that Harris is closer to the source here, then we do not need the projections of Western philosophy and neoliberal theory to explain what is really meant by Singapore’s use of data science. Rather, we can look to Singapore’s culture and perhaps its ideological origins in East Asian thinking. Confucius, not Plato.

* * *

If there it is a disciplinary bias to American philosophy departments, it is that they exist to reproduce anglophone philosophy. This is point that James has recently expressed herself…in fact while I have been in the process of writing this response.

Though I don’t share James’ political project, generally speaking I agree that effort spent of the reproduction of disciplinary terminology is not helpful to the philosophical and scientific projects. Terminology should be deployed for pragmatic reasons in service to objective interests like power, understanding, and freedom. On the other hand, language requires consistency to be effective, and education requires language. My own personal conclusion on is that the scientific project can only be sustained now through disciplinary collapse.

When James suggests that old terms like metaphysics and epistemology prevent the de-centering of the “white supremacist/patriarchal/capitalist heart of philosophy”, she perhaps alludes to her recent coinage of “epistemontology” as a combination of epistemology and ontology, as a way of designating what neoliberalism is. She notes that she is trying to understand neoliberalism as an ideology, not as a historical period, and finds useful the definition that “neoliberals think everything in the universe works like a deregulated, competitive, financialized capitalist market.”

However helpful a philosophical understanding of neoliberalism as market epistemontology might be, I wonder whether James sees the tension between her statements about rejecting traditional terminology that reproduces the philosophical discipline and her interest in preserving the idea of “neoliberalism” in a way that can be be taught in an introduction to philosophy class, a point she makes in a blog comment later. It is, perhaps, in the act of teaching that a discipline is reproduced.

The use of neoliberalism as a target of leftist academic critique has been challenged relatively recently. Craig Hickman, in a blog post about Luis Suarez-Villa, writes:

In fact Williams and Srinicek see this already in their first statement in the interview where they remind us that “what is interesting is that the neoliberal hegemony remains relatively impervious to critique from the standpoint of the latter, whilst it appears fundamentally unable to counter a politics which would be able to combat it on the terrain of modernity, technology, creativity, and innovation.” That’s because the ball has moved and the neoliberalist target has shifted in the past few years. The Left is stuck in waging a war it cannot win. What I mean by that is that it is at war with a target (neoliberalism) that no longer exists except in the facades of spectacle and illusion promoted in the vast Industrial-Media-Complex. What is going on in the world is now shifting toward the East and in new visions of technocapitalism of which such initiatives as Smart Cities by both CISCO (see here) and IBM and a conglomerate of other subsidiary firms and networking partners to build new 21st Century infrastructures and architectures to promote creativity, innovation, ultra-modernity, and technocapitalism.

Let’s face it capitalism is once again reinventing itself in a new guise and all the Foundations, Think-Tanks, academic, and media blitz hype artists are slowly pushing toward a different order than the older market economy of neoliberalism. So it’s time the Left begin addressing the new target and its ideological shift rather than attacking the boogeyman of capitalism’s past. Oh, true, the façade of neoliberalism will remain in the EU and U.S.A. and much of the rest of the world for a long while yet, so there is a need to continue our watchdog efforts on that score. But what I’m getting at is that we need to move forward and overtake this new agenda that is slowly creeping into the mix before it suddenly displaces any forms of resistance. So far I’m not sure if this new technocapitalistic ideology has even registered on the major leftist critiques beyond a few individuals like Luis Suarez-Villa. Mark Bergfield has a good critique of Suarez-Villa’s first book on Marx & Philosophy site: here.

In other words, the continuation of capitalist domination is due to its evolution relative to the stagnation of intellectual critiques of it. Or to put it another way, privilege is the capacity to evolve and not merely reproduce. Indeed, the language game of academic criticality is won by those who develop and disseminate new tropes through which to represent the interests of the marginalized. These privileged academics accomplish what Lyotard describes as “legitimation through paralogy.”

* * * * *

If James were working merely within academic criticality, I would be less interested in the work. But her aspirations appear to be higher, in a new political philosophy that can provide normative guidance in a world where data science is a technical reality. She writes:

Mill has already made–in 1859 no less–the argument that rationalizes the sacrifice of individual liberty for social harmony: as long as such harmony is enforced as a matter of opinion rather than a matter of law, then nobody’s violating anybody’s individual rights or liberties. This is, however, a crap argument, one designed to limit the possibly revolutionary effects of actually granting individual liberty as more than a merely formal, procedural thing (emancipating people really, not just politically, to use Marx’s distinction). For example, a careful, critical reading of On Liberty shows that Mill’s argument only works if large groups of people–mainly Asians–don’t get individual liberty in the first place. [2] So, critiquing Mill’s argument may help us show why updated data-science versions of it are crap, too. (And, I don’t think the solution is to shore up individual liberty–cause remember, individual liberty is exclusionary to begin with–but to think of something that’s both better than the old ideas, and more suited to new material/technical realities.)

It’s because of these more universalist ambitions that I think it’s fair to point out the limits of her argument. If a government’s idea of “social harmony” is not in fact white capitalist but premodern Chinese, if “neoliberalism” is no longer the dominant ideology but rather an idea of an ideology reproduced by a stagnating academic discipline, then these ideas will not help us understand what is going on in the contemporary world in which ‘data science’ is allegedly of such importance.

What would be better than this?

There is an empirical reality to the practices of data science. Perhaps it should be studied on its own terms, without disciplinary baggage.

picking a data backend for representing email in #python

I’m at a difficult crossroads with BigBang where I need to pick an appropriate data storage backend for my preprocessed mailing list data.

There are a lot of different aspects to this problem.

The first and most important consideration is speed. If you know anything about computer science, you know that it exists to quickly execute complex tasks that would take too long to do by hand. It’s odd writing that sentence since computational complexity considerations are so fundamental to algorithm design that this can go unspoken in most technical contexts. But since coming to grad school I’ve found myself writing for a more diverse audience, so…

The problem I’m facing is that in doing exploratory data analysis, I do not know all the questions I am going to ask yet. But any particular question will be impractical to ask unless I tune the underlying infrastructure to answer it. This chicken-and-egg problem means that the process of inquiry is necessarily constrained by the engineering options that are available.

This is not new in scientific practice. Notoriously, the field of economics in the 20th century was shaped by what was analytically tractable as formal, mathematical results. The nuance of contemporary modeling of complex systems is due largely to the fact that we now have computers to do this work for us. That means we can still have the intersubjectively verified rigor that comes with mathematization without trying to fit square pegs into round holes. (Side note: something mathematicians acknowledge that others tend to miss is that mathematics is based on dialectic proof and intersubjective agreement. This makes it much closer epistemologically to something like history as a discipline than it is to technical fields dedicated to prediction and control, like chemistry or structural engineering. Computer science is in many ways an extension of mathematics. Obviously, these formalizations are then applied to great effect. Their power comes from their deep intersubjective validity–in other words, their truth. Disciplines that have dispensed with intersubjective validity as a grounds for truth claims in favor of a more nebulous sense of diverse truths in a manifold of interpretation have difficulty understanding this and so are likely to see the institutional gains of computer scientists to be a result of political manipulation, as opposed to something more basic: mastery of nature, or more provacatively, use of force. This disciplinary disfunction is one reason why these groups see their influence erode.)

For example, I have determined that in order to implement a certain query on the data efficiently, it would be best if another query were constant time. One way to do this is to use a database with an index.

However, setting up a database is something that requires extra work on the part of the programmer and so makes it harder to reproduce results. So far I have been keeping my processed email data “in memory” after it is pulled from files on the file system. This means that I have access to the data within the programming environment I’m most comfortable with, without depending on an external or parallel process. Fewer moving parts means that it is simpler to do my work.

So there is a tradeoff between the computational time of the software as it executes and the time and attention is takes me (and others that want to reproduce my results) to set up the environment in which the software runs. Since I am running this as an open source project and hope others will build on my work, I have every reason to be lazy, in a certain sense. Every inconvenience I suffer is one that will be suffered by everyone that follows me. There is a Kantian categorical imperative to keep things as simple as possible for people, to take any complex procedure and replace it with a script, so that others can do original creative thinking, solve the next problem. This is the imperative that those of us embedded in this culture have internalized. (G. Coleman notes that there are many cultures of hacking; I don’t know how prevalent these norms are, to be honest; I’m speaking from my experience) It is what makes this social process of developing our software infrastructure a social one with a modernist sense of progress. We are part of something that is being built out.

There are also social and political considerations. I am building this project intentionally in a way that is embedded within the Scientific Python ecosystem, as they are also my object of study. Certain projects are trendy right now, and for good reason. At the Python Worker’s Party at Berkeley last Friday, I saw a great presentation of Blaze. Blaze is a project that allows programmers experienced with older idioms of scientific Python programming to transfer their skills to systems that can handle more data, like Spark. This is exciting for the Python community. In such a fast moving field with multiple interoperating ecosystems, there is always the anxiety that ones skills are no longer the best skills to have. Has your expertise been made obsolete? So there is a huge demand for tools that adapt one way of thinking to a new system. As more data has become available, people have engineered new sophisticated processing backends. Often these are not done in Python, which has a reputation for being very usable and accessible but slow to run in operation. Getting the usable programming interface to interoperate with the carefully engineered data backends is hard work, work that Matt Rocklin is doing while being paid by Continuum Analytics. That is sweet.

I’m eager to try out Blaze. But as I think through the questions I am trying to ask about open source projects, I’m realizing that they don’t fit easily into the kind of data processing that Blaze currently supports. Perhaps this is dense on my part. If I knew better what I was asking, I could maybe figure out how to make it fit. But probably, what I’m looking at is data that is not “big”, that does not need the kind of power that these new tools provide. Currently my data fits on my laptop. It even fits in memory! Shouldn’t I build something that works well for what I need it for, and not worry about scaling at this point?

But I’m also trying to think long-term. What happens if an when it does scale up? What if I want to analyze ALL the mailing list data? Is that “big” data?

“Premature optimization is the root of all evil.” – Donald Knuth

the research lately

I’ve been working hard.

I wrote a lot, consolidating a lot of thinking about networked public design, digital activism, and Habermas. A lot of the thinking was inspired by Xiao Qiang’s course over a year ago, then a conversation with Nathan Mathias and Brian Keegan on Twitter, then building @TheTweetserve for Theorizing the Web. Interesting how these things acrete.

Through this, I think I’ve gotten a deeper understanding of Habermas’ system/lifeworld distinction than I’ve ever had before. Where I’m still weak on this is on his understanding of the role of law. There’s an angle in there about technology as regulation (a la Lessig) that ties things back to the recursive public. But of course Habermas was envisioning the normal kind of law–the potentially democratic law. Since the I School engages more with policy than it does with technicality, it would be good to have sharper thinking about this besides vague notions of the injustice or not of “the system”–how much of this rhetoric is owed to Habermas or the people he’s drawing on?

My next big writing project is going to be about Piketty and intellectual property, I hope. This is another argument that I’ve been working out for a long time–as an undergrad working on microeconomics of intellectual property, on the job at OpenGeo reading Lukacs for some reason, in grad school coursework. I tried to write something about this shortly after coming back to school but it went nowhere, partly because I was using anachronistic concepts and partly because the term “hacker” got weird political treatment due to some anti-startup yellow journalism.

The name of the imagined essay is “Free Capital.” It will try to trace the economic implications of free software and other open access technical designs, especially their impact on the relationship between capital and labor. It’s sort of an extension of this. I feel like there is more substance there to dig out, especially around liquidity and vendor- and employer- lock in. I’m imagining engaging some of the VC strategy press–I’ve been following the thinking of Kanyi Maqbela for a long time and always learning from it.

What I need to hone in on in terms of economic modeling is under what conditions it’s in labor’s interest to work to produce open source IP or ‘free capital’, and under what conditions is it in capital’s interest to invest in free capital, and what the macroeconomic implications of this are. It’s clear that capital will invest in free capital in order to unseat a monopoly–take Android for instance, or Firefox–but that this is (a) unstable and (b) difficult to take into account in measures of economic growth, since the gains in this case are to be had in the efficiency of the industrial organization rather than on the the value of the innovation itself. Meanwhile, Matt Asay has been saying for years that the returns on open source investment are not high enough to attract serious investment, and industry experience appears to bear that out.

Meanwhile, Picketty argues that the main force for convergence in income is technology and skills diffusion. But these are exogenous to his model. Meanwhile, here in the Bay Area the gold rush rages on and at least word on the grapevine is that VC money is finding a harder and harder time finding high-return investments, and are sinking it into lamer and lamer teams of recent Stamford undergrads.

My weakness in these arguments is that I don’t have data and don’t even know what predictions I’m making. It’s dangerously theoretical.

Meanwhile, my actual dissertation work progresses…slowly. I managed to get a lot done to get my preliminary results with BigBang ready for SciPy 2014. Since then I’ve switched it over to favor an Anaconda build and use I Python Notebooks internally–all good architectural changes but it’s yak shaving. Now I’m hitting performance issues and need to make some serious considerations about databases and data structures.

And then there’s the social work around it. They are good instincts–that I should be working on accessibility, polishing my communication, trying to encourage collaborator’s interest. I know how to start an open source project and it requires that. But then–what about the research? What about the whole point of the thing? Talking with Dave Kush today, he pointed me towards research on computational discourse analysis, which is where I think this needs to go. The material felt way over my head, a reminder that I’ve been barking up so many trees that are not where I think the real problem to work on is. Mainly because I’ve been caught up in the politics of things. It’s bewildering how enriching but distracting the academic context is–how many barriers there are to sitting and doing your best work. Petty disciplinary disputes, for example.

Theorizing the Web and SciPy conferences compared

I’ve just been through two days of tutorials at SciPy 2014–that stands for Scientific Python (the programming language). The last conference I went to was Theorizing the Web 2014. I wonder if I’m the first person to ever go to both conferences. Since I see my purpose in grad school as being a bridge node, I think it’s worthwhile to write something comparing the two.

Theorizing the Web was held in a “gorgeous warehouse space” in Williamsburg, the neighborhood of Brooklyn, New York that was full of hipsters ten years ago and now is full of baby carriages but still has gorgeous warehouse spaces and loft apartments. The warehouse spaces are actually gallery spaces that only look like warehouses from the outside. On the inside of the one where TtW was held, whole rooms with rounded interior corners were painted white, perhaps for a photo shoot. To call it a “warehouse” is to appeal to the blue color and industrial origins that Brooklyn gentrifiers appeal to in order to distinguish themselves from the elites in Manhattan. During my visit to New York for the conference, I crashed on a friend’s air mattress in the Brooklyn neighborhood I had been gentrifying just a few years earlier. The speakers included empirical scientific researchers, but these were not the focus of the event. Rather, the emphasis was on theorizing in a way that is accessible to the public. The most anticipated speaker was a porn actress. Others were artists or writers of one sort or another. One was a sex worker who then wrote a book. Others were professors of sociology and communications. Another was a Buzzfeed editor.

SciPy is taking place in the AT&T Education and Conference Center in Austin, Texas, near the UT Austin campus. I’m writing from the adjoining hotel. The conference rooms we are using are in the basement; they seat many in comfortable mesh rolling chairs on tiers so everybody can see the dual projector screens. The attendees are primarily scientists who do computationally intensive work. One is a former marine biologist who now does bioinformatics mainly. Another team does robotics. Another does image processing on electron microscope of chromosomes. They are not trying to be accessible to the public. What they are trying to teach is hard enough to get across to others with similar expertise. It is a small community trying to enlarge itself by teaching others its skills.

At Theorizing the Web, the rare technologist spoke up to talk about the dangers of drones. In the same panel, it was pointed out how the people designing medical supply drones for use in foreign conflict zones were considering coloring them white, not black, to make them less intimidating. The implication was that drone designers are racist.

It’s true that the vast majority of attendees of the conference are white and male. To some extent, this is generational. Both tutorials I attended today–including the one one on software for modeling multi-body dynamics, useful for designing things like walking robots–were interracial and taught by guys around my age. The audience has some older folks. These are not necessarily academics, but may be industry types or engineers whose firms are paying them to attend to train on cutting edge technology.

The afterparty first night of Theorizing the Web was in a dive bar in Williamsburg. Brooklyn’s Williamsburg has dive bars the same way Virginia’s Williamsburg has a colonial village–they are a cherished part of its cultural heritage. But the venue was alienating for some. One woman from abroad confided to me that they were intimidated by how cool the bar felt. It was my duty as an American and a former New Yorker to explain that Williamsburg stopped being cool a long time ago.

I’m an introvert and am initially uneasy in basically any social setting. Tonight’s SciPy afterparty was in the downtown office of Enthought, in the Bank of America building. Enthought’s digs are on the 21st floor, with spatious personal offices and lots of whiteboards which display serious use. As an open source product/consulting/training company, it appears to be doing quite well. I imagine really cool people would find it rather banal.

I don’t think it’s overstating things to say that Theorizing the Web serves mainly those skeptical of the scientific project. Knowledge is conceived of as a threat to the known. One panelist at TtW described the problem of “explainer” sites–web sites whose purpose is to explain things that are going on to people who don’t understand them–when they try to translate cultural phenomena that they don’t understand. It was argued that even in cases where these cultural events are public, to capture that content and provide a interpretation or narration around it can be exploitative. Later, Kate Crawford, a very distinguished scholar on civic media, spoke to a rapt audience about the “conjoint anxieties” of Big Data. The anxieties of the watched are matched by the anxieties of the watchmen–like the NSA and, more implicitly, Facebook–who must always seek out more data in order to know things. The implication is that their political or economic agenda is due to a psychological complex–damning if true. In a brilliant rhetorical move that I didn’t quite follow, she tied this in to normcore, which I’m pretty sure is an Internet meme about a fake “fashion” trend in New York. Young people in New York go gaga for irony like this. For some reason earlier this year hipsters ironically wearing unstylish clothing became notable again.

I once met somebody from L.A. who told me their opinion of Brooklyn was that all nerds gathered in one place and thought they could decide what cool was just by saying so. At the time I had only recently moved to Berkeley and was still adjusting. Now I realize how parochial that zeitgeist is, however much I may still identify with it some.

Back in Austin, I have interesting conversations with folks at the SciPy party. One conversation is with two social scientists (demographic observation: one man, one woman) from New York that work on statistical analysis of violent crime in service to the city. They talk about the difficulty of remaining detached from their research subjects, who are eager to assist with the research somehow, though this would violate the statistical rigor of their study. Since they are doing policy research, objectivity is important. They are painfully aware of the limitations of their methods and the implications this has on those their work serves.

Later, I’m sitting alone when I’m joined by an electrical engineer turned programmer. He’s from Tennessee. We talk shop for a bit but the conversation quickly turns philosophical–about the experience of doing certain kinds of science, the role of rationality in human ethics, whether religion is an evolved human impulse and whether that mattes. We are joined by a bioinformatics researcher from Paris. She tells us later that she has an applied math/machine learning background.

The problem in her field, she explains, is that for rare diseases it is very hard to find genetic causes because there isn’t enough data to do significant inference. Genomic data is very highly dimensional–thousands of genes–and for some diseases there may be less than fifty cases to study. Machine learning researchers are doing their best to figure out ways for researchers to incorporate “prior knowledge”–theoretical understanding from beyond the data available–to improve their conclusions.

Over meals the past couple days I’ve been checking Twitter, where a lot of the intellectuals who organize Theorizing the Web or are otherwise prominent in that community are active. One conversation extended conversation is about the relative failure of the open source movement to produce compelling consumer products. My theory is that this has to do with business models and the difficulty of coming up with upfront capital investment. But emotionally my response to that question is that it is misplaced: consumer products are trivial. Who cares?

Today, folks on Twitter are getting excited about using Adorno’s concept of the culture industry to critique Facebook’s emotional contagion experiment and other media manipulation. I find this both encouraging–it’s about time the Theorizing the Web community learned to embrace Frankfurt School thought–and baffling, because I believe they are misreading Adorno. The culture industry is that sector of the economy that produces cultural products, like Hollywood and television productions companies. On the Internet, the culture industry is Buzzfeed, the Atlantic, and to a lesser extent (though this is surely masked by it’s own ideology) The New Inquiry. My honest opinion for a long time has been that the brand of “anticapitalist” criticality indulged in on-line is a politically impotent form of entertainment equivalent to the soap opera. A concept more appropriate for understanding Facebook’s role in controlling access to news and the formation of culture is Habermas’ idea of steering media.

He gets into this in Theory of Communicative Action, vol. 2, which is underrated in America probably due to its heaviness.

Preparing for SciPy 2014

I’ve been instructed to focus my attention on mid-level concepts rather than grand theory as I begin my empirical work.is

This is difficult for me, as I tend to oscillate between thinking very big and thinking very narrowly. This is an occupational hazard of a developer. Technical minutiae accumulate into something durable and powerful. To sustain ones motivation one has to be able to envision ones tiny tasks (correcting the spelling of some word in a program) stepping towards a larger project.

I’m working in my comfort zone. I’ve got my software project open on GitHub and I’m preparing to present my preliminary results at SciPy 2014 next week. A colleague and mentor I met with today told me it’s not a conference for people marking up career points. It’s a conference for people to meet each other, get an update on how their community is doing as a whole, and to learn new skills from each other.

It’s been a few years since I’ve been to a developer conference. In my past career I went to FOSS4G, the open source geospatial conference, a number of times. In 2008, the conference was in South Africa. I didn’t know anybody, so I blogged about it, and got chastised for being too divisive. I wasn’t being sensitive to the delicate balance between the open source developer geospatial community and their greatest proprietary coopetitor, ESRI. I was being an ideologue at a time when the open source model was in that industry just in its inflection point and becoming mainstream. Obviously I didn’t understand the subtlety of the relationships, business and personal, threaded through the conference.

Later I attended FOSS4G in 2010 to pitch the project my team had recently launched, GeoNode. It was a very exciting time for me personally. I was very personally invested in the project, and I was so proud of my team and myself for pulling through on the beta release. In retrospect, building a system for serving spatial data modeled on a content management system seems like a no-brainer. Today there are plenty of data management startups and services out there, some industrial, some academic. But at the time we were ahead of the curve, thanks largely to the vision of Chris Holmes, who at the time the wunderkind visionary president of OpenGeo.

Cholmes always envisioned OpenGeo turning into an anti-capitalist organization, a hacker coop with as much transparency as it could handle. If only it could get its business model right. It was incubating in a pre-crash bubble that thinned out over time. I was very into the politics of the organization when I joined it, but over time I became more cynical and embraced the economic logic I was being taught by the mature entrepreneurs who had been attracted to OpenGeo’s promise and standing in the geospatial world. While trying to wrap my head around managing developers, clients, and the budget around GeoNode, I began to see why businesses are the way they are, and how open source plays out in the industrial organization of the tech industry as a whole.

GeoNode, the project, remains a success. There is glory to that, though in retrospect I can claim little of it. I made many big mistakes and the success of the project has always been due to the very intelligent team working on it, as well as its institutional positioning.

I left OpenGeo because I wanted to be a scientist. I had spent four years there, and had found my way onto a project where we were building data plumbing for disaster reduction scientists and the military. OpenGeo had become a victim of its own success and outgrown its non-profit incubator, buckling under the weight of the demand for its services. I had deferred enrollment at Berkeley for a year to see GeoNode through to a place where it couldn’t get canned. My last major act was to raise funding for a v1.1 release that fixed the show-stopping bugs in the v1.0 version.

OpenGeo is now Boundless, a for-profit company. It’s better that way. It’s still doing revolutionary work.

I’ve been under the radar in the open source world for the three years I’ve been in grad school. But as I begin this dissertation work, I feel myself coming back to it. My research questions, in one framing, are about software ecosystem sustainability and management. I’m drawing from my experience participating in and growing open source communities and am trying to operationalize my intuitions from that work. At Berkeley I’ve discovered the scientific Python community, which I feel at home with since I learned about how to do open source from the inimitable Whit Morris, a Pythonista of the Plone cohort, among others.

After immersing myself in academia, I’m excited to get back into the open source development world. Some of the most intelligent and genuine people I’ve ever met work in that space. Like the sciences, it is a community of very smart and creative people with the privilege to pursue opportunity but with goals that go beyond narrow commercial interests. But it’s also in many ways a more richly collaborative and constructive community than the academic world. It’s not a prestige economy, where people are rewarded with scarce attention and even scarcer titles. It’s a constructive economy, where there is always room to contribute usefully, and to be recognized even in a small way for that contribution.

I’m going to introduce my research on the SciPy communities themselves. In the wake of the backlash against Facebook’s “manipulative” data science research, I’m relieved to be studying a community that has from the beginning wanted to be open about its processes. My hope is that my data scientific work will be a contribution to, not an exploitation of, the community I’m studying. It’s an exciting opportunity that I’ve been preparing for for a long time.

metaphorical problems with logical solutions

There are polarizing discourses on the Internet about the following four dichotomies:

  • Public vs. Private (information)
  • (Social) Inclusivity vs. Exclusivity.
  • Open vs. Closed (systems, properties, communities).

Each of these pairings enlists certain metaphors and intuitions. Rarely are they precisely defined.

Due to their intuitive pull, it’s easy to draw certain naive associations. I certainly do. But how do they work together logically?

To what extent can we fill in other octants of this cube? Or is that way of modeling it too simplistic as well?

If privacy is about having contextual control over information flowing out of oneself, then that means that somebody must have the option of closing off some access to their information. To close off access is necessarily to exclude.

PRIVATE => ¬OPEN => ¬INCLUSIVE

But it has been argued that open sociotechnical systems exclude as well by being inhospitable to those with greater need for privacy.

OPEN => ¬PRIVATE => ¬INCLUSIVE

These conditionals limit the kinds of communities that can exist.

PRIVATE OPEN INCLUSIVE POSSIBLE?
T T T F
T T F F
T F T F
T F F T
F T T F
F T F T
F F T F
F F F T

Social inclusivity in sociotechnical systems is impossible. There is no such thing as a sociotechnical system that works for everybody.

There are only three kinds of systems: open systems, private systems, or systems that are neither open nor private. We can call the latter leaky systems.

These binary logical relations capture only the limiting properties of these systems. If there has ever been an open system, it is the Internet; but everyone knows that even the Internet isn’t truly open because of access issues.

The difference between a private system and a leaky system is participant’s ability to control how their data escapes the system.

But in this case, systems that we call ‘open’ are often private systems, since participants choose whether or not to put information into the open.

So is the only question whether and when information is disclosed vs. leaked?

real talk

So I am trying to write a dissertation prospectus. It is going…OK.

The dissertation is on Evaluating Data Science Environments.

But I’ve been getting very distracted by the politics of data science. I have been dealing with the politics by joking about them. But I think I’m in danger of being part of the problem, when I would rather be part of the solution.

So, where do I stand on this, really?

Here are some theses:

  • There is a sense of “data science” that is importantly different from “data analytics”, though there is plenty of abuse of the term in an industrial context. That claim is awkward because industry can easily say they “own” the term. It would be useful to lay out specifically which computational methods constitute “data science” methods and which don’t.
  • I think that it’s useful analytically to distinguish different kinds of truth claims because it sheds light on the value of different kinds of inquiry. There is definitely a place for rigorous interpretive inquiry and critical theory in addition to technical, predictive science. I think politicing around these divisions is lame and only talk about it to make fun of the situation.
  • New computational science techniques have done and will continue to do amazing work in the physical and biological and increasingly environmental sciences. I am jealous of researchers in those fields because I think that work is awesome. For some reason I am a social scientist.
  • The questions surrounding the application of data science to social systems (which can include environmental systems) are very, very interesting. Qualitative researchers get defensive about their role in “the age of data science” but I think this is unwarranted. I think it’s the quantitative social science researchers who are likely more threatened methodologically. But since I’m not well-trained as a quantitative social scientist really, I can’t be sure of that.
  • The more I learn about research methods (which seems to be all I study these days, instead of actually doing research–I’m procrastinating), the more I’m getting a nuanced sense of how different methods are designed to address different problems. Jockeying about which method is better is useless. If there is a political battle I think is worth fighting any more, it’s the battle about whether or not transdisciplinary research is productive or possible. I hypothesize that it is. But I think this is an empirical question whose answer may be specific: how can different methods be combined effectively? I think this question gets quite deep and answering it requires getting into epistemology and statistics in a serious way.
  • What is disruptive about data science is that some people have dug down into statistics in a serious way, come up with a valid general way of analyzing things, and then automated it. That makes it in theory cheaper to pick up and apply than the quantitative techniques used by other researchers, and usable at larger scale. On the whole this is pretty good, though it is bad when people don’t understand how the tools they are using work. Automating science is a pretty good thing over all.
  • It’s really important for science, as it is automated, to be built on open tools and reproducible data because (a) otherwise there is no reason why it should have the public trust, (b) because it will remove barriers to training new scientists.
  • All scientists are going to need to know how to program. I’m very fortunate to have a technical background. A technical background is not sufficient to do science well. One can use technical skills to assist in both qualitative (visualization) and quantitative work. The ability to use tools is orthogonal to the ability to study phenomena, despite the historic connection between mathematics and computer science.
  • People conflate programming, which is increasingly a social and trade skill, with the ability to grasp high level mathematical concepts.
  • Computers are awesome. The people that make them better deserve the credit they get.
  • Sometimes I think: should I be in a computer science department? I think I would feel better about my work if I were in CS. I like the feeling of tangible progress and problem solving. I think there are a lot of really important problems to solve, and that the solutions will likely come from computer science related work. What I think I get from being in a more interdisciplinary department is a better understanding of what problems are worth solving. I don’t mean that in a way that diminishes the hard work of problem solving, which I think is really where the rubber hits the road. It is easy to complain. I don’t work as hard as computer science students. I also really like being around women. I think they are great and there aren’t enough of them in computer science departments.
  • I’m interested in modeling and improving the cooperation around open scientific software because that’s where I see there some real potential value add. I’ve been and engineer and I’ve managed engineers. Managing engineers is a lot harder than engineering, IMO. That’s because management requires navigating a social system. Social systems are really absurdly complicated compared to even individual organisms.
  • There are three reasons why it might be bad to apply data science to social systems. The first is that it could lead to extraordinarily terrible death robots. My karma is on the line. The second is that the scientific models might be too simplistic and lead to bad decisions that are insensitive to human needs. That is why it is very, very important that the existing wealth of social scientific understanding is not lost but rather translated into a more robust and reproducible form. The third reason is that social science might be in principle impossible due to its self-referential effects. This would make the whole enterprise a collosal waste of time. The first and third reasons frequently depress me. The second motivates me.
  • Infrastructure and mechanism design are powerful means of social change, perhaps the most powerful. Movements are important but civil society is so paralyzed by the steering media now that it is more valuable to analyze movements as sociotechnical organizations alongside corporations etc. than to view them in isolation from the technical substrate. There are a variety of ideological framings of this position, each with different ideological baggage. I’m less concerned with that, ultimately, than the pragmatic application of knowledge. I wish people would stop having issues with “implications for design.”
  • I said I wanted to get away from politics, but this is one other political point I actually really think is worth making, though it is generally very unpopular in academia for obvious reasons: the status differential between faculty and staff is an enormous part of the problem of the disfunction of universities. A lot of disciplinery politics are codifications of distaste for certain kinds of labor. In many disciplines, graduate students perform labor unexpertly in service of their lab’s principal investigators; this labor is a way of paying ones dues that has little to do with the intellectual work of their research expertise. Or is it? It’s entirely unclear, especially when what makes the difference between a good researcher and a great one are skills that have nothing to do with their intellectual pursuit, and when master new tools is so essential for success in ones field. But the PIs are often not able to teach these tools. What is the work of research? Who does it? Why do we consider science to be the reserve of a specialized medieval institution, and call it something else when it is done by private industry? Do academics really have a right to complain about the rise of the university administrative class?

Sorry, that got polemical again.

Reflections on the Berkeley Institute for Data Science (BIDS) Launch

Last week was the launch of the Berkeley Institute for Data Science.

Whatever might actually happen as a result of the launch, what was said at the launch was epic.

Vice Chancellor of research Graham Flemming introduced Chancellor Nicholas Dirks for the welcoming remarks. Dirks is UC Berkeley’s 10th Chancellor. He succeeded Robert Birgeneau, who resigned gracefully shortly after coming under heavy criticism for his handling of Occupy Cal, the Berkeley campus’ chapter of the Occupy movement. He was distinctly unsympathetic to the protesters, and there was a widely circulated petition declaring a lack of confidence in his leadership. Birgeneau is a physicist. Dirks is an anthropologist who has championed postcolonial approaches. Within the politics of the university, which are a microcosm of politics at large, this signalling is clear. Dirks’ appointment was meant to satisfy the left wing protesters, most of whom have been trained in softer social sciences themselves. Critical reflection on power dynamics and engagement in activism–which is often associated with leftist politics–are, at least formally, accepted by the university administration as legitimate. Birgeneau would subsequently receive awards for his leadership in drawing more women into the sciences and aiding undocumented students.

Dirks’ welcoming remarks were about the great accomplishments of UC Berkeley as a research institution the vague but extraordinary potential of BIDS. He is grateful, as we all are, for the funding from the Moore and Sloan foundations. I found his remarks unspecific, and I couldn’t help but wonder what his true thoughts were about data science in the university. Surely he must have an opinion. As an anthropologist, can he consistently believe that data science, especially in the social sciences, is the future?

Vicki Chandler, Chief Program Officer from the Moore Foundation, was more lively. Pulling no punches, she explained that the purpose of BIDS is to shake up scientific culture. Having hung out in Berkeley in the 60’s and attended it as an undergraduate in the 70’s, she believes we are up for it. She spoke again and again of “revolution”. There is ambiguity in this. In my experience, faculty are divided on whether they see the proposed “open science” changes as imminent or hype, as desirable or dangerous. More and more I see faculty acknowledge that we are witnessing the collapse of the ivory tower. It is possible that the BIDS launch is a tipping point. What next? “Let the fun begin!” concluded Chandler.

Saul Perlmutter, Nobel laureate physicist and front man of the BIDS co-PI super group, gave his now practiced and condensed pitch for the new Institute. He hit all the high points, pointing out not only the potential of data science but the importance of changing the institutions themselves. Rethinking the peer-review journal from scratch, he said, we should focus more on code reuse. Software can be a valid research output. As much as open science is popular among the new generation of scientists, this is a bold statement for somebody with such credibility within the university. He even said that the success of open source software is what gives us hope for the revolutionary new kind of science BIDS is beginning. Two years ago, this was a fringe idea. Perlmutter may have just made it mainstream.

Notably, he also engaged with the touchy academic politics, saying that data science could bring diversity to the sciences (though he was unspecific about the mechanism for this). He expounded on the important role of ethnography in evaluating the Institute to identify the bottlenecks to its unlocking its potential.

The man has won at physics and is undoubtedly a scientist par excellance. Perhaps Perlmutter sees the next part of his legacy as the bringing of the university system into the 21st century.

David Culler, Chair of the Electrical Engineering and Computer Science department, then introduced a number of academic scientists, each with impressive demonstrations about how data science could be applied to important problems like climate change and disaster reduction. Much of this research depends on using the proliferation of hand-held mobile devices as sensors. University science, I realized while watching this, is at its best when doing basic research about how to save humanity from nature or ourselves.

But for me the most interesting speakers in the first half of the launch were luminaries Peter Norvig and Tim O’Reilly, each giants in their own right and welcome guests to the university.

Culler introduced Norvig, Director of Research at Google, by crediting him as one of the inventors of the MOOC. I know his name mainly as a co-author of “Artificial Intelligence: A Modern Approach,” which I learned and taught from as an undergraduate. Amazingly, Norvig’s main message is about the economics of the digital economy. Marginal production is cheap, cost of communication is cheap, and this leads to an accumulation of wealth. Fifty percent of jobs are predicted to be automated away in the coming decades. He is worried about the 99%–freely using Occupy rhetoric. What will become of them? Norvig’s solution, perhaps stated tongue in cheek, is that everyone needs to become a data scientist. More concretely, he has high hopes for hybrid teams of people and machines, that all professions will become like this. By defining what academic data science looks like and training the next generation of researchers, BIDS will have a role in steering the balance of power between humanity and the machines–and the elite few who own them.

His remarks hit home. He touched on anxieties that are as old as the Industrial Revolution: is somebody getting immensely rich off of these transformations, but not me? What will my role be in this transformed reality? Will I find work? These are real problems and Norvig was brave to bring them up. The academics in the room were not immune from these anxieties either, as they watch the ivory tower crumble around them. This would come up again later in the day.

I admire him for bringing up the point, and I believe he is sincere. I’d heard him make the same points when he was on a panel with Neil Stephenson and Jaron Lanier a month or so earlier. I can’t help but be critical of Norvig’s remarks. Is he covering his back? Many university professors are seeing MOOCs themselves as threatening to their own careers. It is encouraging that he sees the importance of hybrid human/machine teams. If the machines are built on Google infrastructure, doesn’t this contribute to the same inequality he laments, shifting power away from teachers to the 1% at Google? Or does he foresee a MOOC-based educational boom?

He did not raise the possibility that human/machine hybridity is already the status quo–that, for example, all information workers tap away at these machines and communicate with each other through a vast technical network. If he had acknowledged that we are all cyborgs already, he would have had to admit that hybrid teams of humans and machines are as much the cause of as solution to economic inequality. Indeed, this relationship between human labor and mechanical capital is precisely the same as the one that created economic inequality in the Industrial Revolution. When the capital is privately owned, the systems of hybrid human/machine productivity favor the owner of the machines.

I have high hopes that BIDS will address through its research Norvig’s political concern. It is certainly on the mind of some of its co-PI’s, as later discussion would show. But to address the problem seriously, it will have to look at the problem in a rigorous way that doesn’t shy away from criticism of the status quo.

The next speaker, Tim O’Reilly, is a figure who fascinates me. Culler introduced him as a “God of the Open Source Field,” which is poetically accurate. Before coming to academia, I worked on Web 2.0 open source software platforms for open government. My career was defined by a string of terms invented and popularized by O’Reilly, and to a large extent I’m still a devotee of his ideas. But as a practitioner and researcher, I’ve developed a nuanced view of the field that I’ve tried to convey in the course on Open Collaboration and Peer Production I’ve co-instructed with Thomas Maillart this semeser.

O’Reilly came under criticism earlier this year from Evgeny Morozov, who attacked him for marketing politically unctuous ideas while claiming to be revolutionary. He focuses on his promotion of ‘open source’ over and against Richard Stallman’s explicitly ethical and therefore contentious term ‘free software‘. Morozov accuses O’Reilly of what Tom Scocca has recently defined as rhetorical smarm–dodging specific criticism by denying the appropriateness of criticism in general. O’Reilly has disputed the Morozov piece. Elsewhere he has presented his strategy as a ‘marketer of big ideas‘, and his deliberate promoting of more business-friendly ‘open source’ rhetoric. This ideological debate is itself quite interesting. Geek anthropologist Chris Kelty observes that it is participation in this debate, more so than an adherence to any particular view in it, that characterizes the larger “movement,” which he names the recursive public.

Despite his significance to me, with an open source software background, I was originally surprised when I heard Tim O’Reilly would be speaking at the BIDS launch. O’Reilly had promoted ‘open source’ and ‘Web 2.0′ and ‘open government’, but what did that have to do with ‘data science’?

So I was amused when Norvig introduced O’Reilly by saying that he didn’t know he was a data scientist until the latter wrote an article in Forbes (in November 2011) naming him one of “The World’s 7 Most Powerful Data Scientists.” Looking at the Google Trends data, we can see that November 2011 just about marks the rise of ‘data science’ from obscurity to popularity. Is Tim O’Reilly responsible for the rise of ‘data science’?

Perhaps. O’Reilly’s explained that he got into data science by thinking about the end game for open source. As open source software becomes commodified (which for him I think means something like ‘subject to competitive market pressure), what becomes valuable is the data. And so he has been promoting data science in industry and government, and believes that the university can learn important lessons from those fields as well. He held up his Moto X phone, explained how it is ‘always listening’ and so can facilitate services like Google Now. All this would go towards a system with greater collective intelligence, a self-regulating system that would make regulators obsolete.

Looking at the progression of the use of maps, from paper to digital to being embedded in services and products like self-driving cars, O’Reilly agrees with Norvig about the importance of human-machine interaction. In particular, he believes that data scientists will need to know how to ask the right questions about data, and that this is the future of science. “Others will be left behind,” he said, not intending to sound foreboding.

I thought O’Reilly presented the combination of insight and boosterism I expected. To me, his presence at the BIDS launch meant to me that O’Reilly’s significance as a public intellectual has progressed from business through governance and now to scientific thinking itself. This is wonderful for him but means that his writings and influence should be put under the scrutiny we would have for an academic peer. It is appropriate to call him out for glossing over the privacy issues around a mobile phone that is “always listening,” or the moral implications of the obsolescence of regulators for equality and justice. Is his objectivity compromised by the fact that he runs a publishing company that sells complementary goods to the vast supply of publicly available software and data? Does his business agenda incentivize him to obscure the subtle differences between various segements of his market? Are we in the university victims of that obscurity as we grapple with multiple conflated meanings of “openness” in software and science (open to scrutiny and accountability, vs. open for appropriation by business, vs. open to meritocratic contribution)? As we ask these questions, we can be grateful to O’Reilly for getting us this far.

I’ve emphasized the talks given by Norvig and O’Reilly because they exposed what I think are some of the most interesting aspects of BIDS. One way or another, it will be revolutionary. Its funders will be very disappointed if it is not. But exactly how it is revolutionary is undetermined. The fact that BIDS is based in Berkeley, and not in Google or Microsoft or Stanford, guarantees that the revolution will not be an insipid or smarmy one which brushes aside political conflict or morality. Rather, it promises to be the site of fecund political conflict. “Let the fun begin!” said Chandler.

The opening remarks concluded and we broke for lunch and poster sessions–the Data Science Faire (named after O’Reilly’s Maker Faire…

What followed was a fascinating panel discussion led by astrophysicist Josh Bloom, historian and university administrator Cathryn Carson, computer science professor and AMP Lab director Michael Franklin, and Deb Agrawal, a staff computer scientist for Lawrence Berkeley National Lab.

Bloom introduced the discussion jokingly as “just being among us scientists…and whoever is watching out there on the Internet,” perhaps nodding to the fact that the scientific community is not yet fully conscious that their expectations of privileged communication are being challenged by a world and culture of mobile devices that are “always listening.”

The conversation was about the role of people in data science.

Carson spoke as a domain scientist–a social scientist who studies scientists. Noting that social scientists tend to work in small teams lead by graduate students motivated by their particular questions, she said her emphasis was on the people asking questions. Agrawal noted that the number of people needed to analyze a data set does not scale with the size of data, but the complexity of data–a practical point. (I’d argue that theoretically we might want to consider “size” of data in terms of its compressibility–which would reflect its complexity. This ignores a number of operational challenges.) For Franklin, people are a computational resource that can be part of a crowd-sourced process. In that context, the number of people needed does indeed scale with the use of people as data processors and sensors.

Perhaps to follow through on Norvig’s line of reasoning, Bloom then asked pointedly if machines would ever be able to do the asking of questions better than human beings. In effect: Would data science make data scientists obsolete?

Nobody wanted to be the first to answer this question. Bloom had to repeat it.

Agrawal took a first stab at it. The science does not come from the data; the scientist chooses models and tests them. This is the work of people. Franklin agreed and elaborated–the wrong data too early can ruin the science. Agrawal noted that computers might find spurious signals in the noise.

Personally, I find these unconvincing answers to Bloom’s question. Algorithms can generate, compare, and test alternative models against the evidence. Noise can, with enough data, be filtered away from the signal. To do so pushes the theoretical limits of computing and information theory, but if Franklin is correct in his earlier point that people are part of the computational process, then there is no reason in principle why these tasks too might not be performed if not assisted by computers.

Carson, who had been holding back her answer to listen to the others, had a bolder proposal: rather than try to predict the future of science, why not focus on the task of building that future?

In another universe, at that moment someone might have asked the one question no computer could have answered. “If we are building the new future of science, what should we build? What should it look like? And how do we get there?” But this is the sort of question disciplined scientists are trained not to ask.

Instead, Bloom brought things back to practicality: we need to predict where science will go in order to know how to educate the next generation of scientists. Should we be focusing on teaching them domain knowledge, or on techniques?

We have at the heart of BIDS the very fundamental problem of free will. Bloom suggests that if we can predict the future, then we can train students in anticipation of it. He is an astrophysics and studies stars; he can be forgiven for the assumption that bodies travel in robust orbits. This environment is a more complex one. How we choose to train students now will undoubtedly affect how science evolves, as the process of science is at once the process of learning and training new scientists. His descriptive question then falls back to the normative one: what science are we trying to build toward?

Carson was less heavy-handed than I would have been in her position. Instead, she asked Bloom how he got interested in data science. Bloom recalled his classical physics training, and the moment he discovered that to answer the kinds of questions he was asking, he would need new methods.

Franklin chimed in on the subject of education. He has heard it said that everyone in the next generation should learn to code. With marked humility for his discipline, he said he did not agree with this. But he said he did believe that everyone in the next generation should learn data literacy, echoing Norvig.

Bloom opened the discussion to questions from the audience.

The first was about the career paths for methodologists who write software instead of papers. How would BIDS serve them? It was a soft ball question which the panel hit out of the park. Bloom noted that the Moore and Sloan funders explicitly asked for the development of alternative metrics to measure the impact of methodologist contributions. Carson said that even with the development of metrics, as an administrator she knew it would be a long march through the institution to get those metrics recognized. There was much work to be done. “Universities got to change,” she rallied. “If we don’t change, Berkeley’s being great in the past won’t make it great in the future,” referring perhaps to the impressive history of research recounted by Chancellor Dirks. There was applause. Franklin pointed out that the open source community has its own metrics already. In some circles some of his students are more famous than he is for developing widely used software. Investors are often asking him when his students will graduate. The future, it seems, is bright for methodologists.

At this point I lost my Internet connection and had to stop livetweeting the panel; those tweets are the notes from which I am writing these reflections. Recalling from memory, there was one more question from Kristina Kangas, a PhD student in Integrative Biology. She cited research about how researchers interpreting data wind up reflecting back their own biases. What did this mean for data science?

Bloom gave Carson the last word. It is a social scientific fact, she said, that scientists interpret data in ways that fit their own views. So it’s possible that there is no such thing as “data literacy”. These are open questions that will need to be settled by debate. Indeed, what then is data science after all? Turning to Bloom, she said, “I told you I would be making trouble.”

Follow

Get every new post delivered to your Inbox.

Join 897 other followers