Digifesto

Category: open source software

moved BigBang core repository to DATACTIVE organization

I made a small change this evening which I feel really, really good about.

I transferred the BigBang project from my personal GitHub account to the datactive organization.

I’m very grateful for DATACTIVE‘s interest in BigBang and am excited to turn over the project infrastructure to their stewardship.

trust issues and the order of law and technology cf @FrankPasquale

I’ve cut to the last chapter of Pasquale’s The Black Box Society, “Towards an Intelligible Society.” I’m interested in where the argument goes. I see now that I’ve gotten through it that the penultimate chapter has Pasquale’s specific policy recommendations. But as I’m not just reading for policy and framing but also for tone and underlying theoretical commitments, I think it’s worth recording some first impressions before doubling back.

These are some points Pasquale makes in the concluding chapter that I wholeheartedly agree with:

  • A universal basic income would allow more people to engage in high risk activities such as the arts and entrepreneurship and more generally would be great for most people.
  • There should be publicly funded options for finance, search, and information services. A great way to provide these would be to fund the development of open source algorithms for finance and search. I’ve been into this idea for so long and it’s great to see a prominent scholar like Pasquale come to its defense.
  • Regulatory capture (or, as he elaborates following Charles Lindblom, “regulatory circularity”) is a problem. Revolving door participation in government and business makes government regulation an unreliable protector of the public interest.

There is quite a bit in the conclusion about the specifics of regulation the finance industry. There is an impressive amount of knowledge presented about this and I’ll admit much of it is over my head. I’ll probably have a better sense of it if I get to reading the chapter that is specifically about finance.

There are some things that I found bewildering or off-putting.

For example, there is a section on “Restoring Trust” that talks about how an important problem is that we don’t have enough trust in the reputation and search industries. His solution is to increase the penalties that the FTC and FCC can impose on Google and Facebook for its e.g. privacy violations. The current penalties are too trivial to be effective deterrence. But, Pasquale argues,

It is a broken enforcement model, and we have black boxes to thank for much of this. People can’t be outraged by what they can’t understand. And without some public concern about the trivial level of penalties for lawbreaking here, there are no consequences for the politicians ultimately responsible for them.

The logic here is a little mad. Pasquale is saying that people are not outraged enough by search and reputation companies to demand harsher penalties, and this is a problem because people don’t trust these companies enough. The solution is to convince people to trust these companies less–get outraged by them–in order to get them to punish the companies more.

This is a bit troubling, but makes sense based on Pasquale’s theory of regulatory circularity, which turns politics into a tug-of-war between interests:

The dynamic of circularity teaches us that there is no stable static equilibrium to be achieved between regulators and regulated. The government is either pushing industry to realize some public values in its activities (say, by respecting privacy or investing in sustainable growth), or industry is pushing regulators to promote its own interests.

There’s a simplicity to this that I distrust. It suggests for one that there are no public pressures on industry besides the government such as consumer’s buying power. A lot of Pasquale’s arguments depend on the monopolistic power of certain tech giants. But while network effects are strong, it’s not clear whether this is such a problem that consumers have no market buy in. In many cases tech giants compete with each other even when it looks like they aren’t. For example, many many people have both Facebook and Gmail accounts. Since there is somewhat redundant functionality in both, consumers can rather seemlessly allocate their time, which is tied to advertising revenue, according to which service they feel better serves them, or which is best reputationally. So social media (which is a bit like a combination of a search and reputation service) is not a monopoly. Similarly, if people have multiple search options available to them because, say, the have both Siri on their smart phone and can search Google directly, then that provides an alternative search market.

Meanwhile, government officials are also often self-interested. If there is a road to hell for industry that is to provide free web services to people to attain massive scale, then abuse economic lock-in to extract value from customers, then lobby for further rent-seeking, there is a similar road to hell in government. It starts with populist demagoguery, leads to stable government appointment, and then leverages that power for rents in status.

So, power is power. Everybody tries to get power. The question is what you do once you get it, right?

Perhaps I’m reading between the lines too much. Of course, my evaluation of the book should depend most on the concrete policy recommendations which I haven’t gotten to yet. But I find it unfortunate that what seems to be a lot of perfectly sound history and policy analysis is wrapped in a politics of professional identity that I find very counterproductive. The last paragraph of the book is:

Black box services are often wondrous to behold, but our black-box society has become dangerously unstable, unfair, and unproductive. Neither New York quants nor California engineers can deliver a sound economy or a secure society. Those are the tasks of a citizenry, which can perform its job only as well as it understands the stakes.

Implicitly, New York quants and California engineers are not citizens, to Pasquale, a law professor based in Maryland. Do all real citizens live around Washington, DC? Are they all lawyers? If the government were to start providing public information services, either by hosting them themselves or by funding open source alternatives, would he want everyone designing these open algorithms (who would be quants or engineers, I presume) to move to DC? Do citizens really need to understand the stakes in order to get this to happen? When have citizens, en masse, understood anything, really?

Based on what I’ve read so far, The Black Box Society is an expression of a lack of trust in the social and economic power associated with quantification and computing that took off in the past few dot-com booms. Since expressions of lack of trust for these industries is nothing new, one might wonder (under the influence of Foucault) how the quantified order and the critique of the quantified order manage to coexist and recreate a system of discipline that includes both and maintains its power as a complex of superficially agonistic forces. I give sincere credit to Pasquale for advocating both series income redistribution and public investment in open technology as ways of disrupting that order. But when he falls into the trap of engendering partisan distrust, he loses my confidence.

Innovation, automation, and inequality

What is the economic relationship between innovation, automation, and inequality?

This is a recurring topic in the discussion of technology and the economy. It comes up when people are worried about a new innovation (such as data science) that threatens their livelihood. It also comes up in discussions of inequality, such as in Picketty’s Capital in the Twenty-First Century.

For technological pessimists, innovation implies automation, and automation suggests the transfer of surplus from many service providers to a technological monopolist providing a substitute service at greater scale (scale being one of the primary benefits of automation).

For Picketty, it’s the spread of innovation in the sense of the education of skilled labor that is primary force that counteracts capitalism’s tendency towards inequality and (he suggests) the implied instability. For the importance Picketty places on this process, he treats it hardly at all in his book.

Whether or not you buy Picketty’s analysis, the preceding discussion indicates how innovation can cut both for and against inequality. When there is innovation in capital goods, this increases inequality. When there is innovation in a kind of skilled technique that can be broadly taught, that decreases inequality by increasing the relative value of labor to capital (which is generally much more concentrated than labor).

I’m a software engineer in the Bay Area and realize that it’s easy to overestimate the importance of software in the economy at large. This is apparently an easy mistake for other people to make as well. Matthew Rognlie, the economist who has been declared Picketty’s latest and greatest challenger, thinks that software is an important new form of capital and draws certain conclusions based on this.

I agree that software is an important form of capital–exactly how important I cannot yet say. One reason why software is an especially interesting kind of capital is that it exists ambiguously as both a capital good and as a skilled technique. While naively one can consider software as an artifact in isolation from its social environment, in the dynamic information economy a piece of software is only as good as the sociotechnical system in which it is embedded. Hence, its value depends both on its affordances as a capital good and its role as an extension of labor technique. It is perhaps easiest to see the latter aspect of software by considering it a form of extended cognition on the part of the software developer. The human capital required to understand, reproduce, and maintain the software is attained by, for example, studying its source code and documentation.

All software is a form of innovation. All software automates something. There has been a lot written about the potential effects of software on inequality through its function in decision-making (for example: Solon Barocas, Andrew D. Selbst, “Big Data’s Disparate Impact” (link).) Much less has been said about the effects of software on inequality through its effects on industrial organization and the labor market. After having my antennas up for this for many reasons, I’ve come to a conclusion about why: it’s because the intersection between those who are concerned about inequality in society and those that can identify well enough with software engineers and other skilled laborers is quite small. As a result there is not a ready audience for this kind of analysis.

However unreceptive society may be to it, I think it’s still worth making the point that we already have a very common and robust compromise in the technology industry that recognizes software’s dual role as a capital good and labor technique. This compromise is open source software. Open source software can exist both as an unalienated extension of its developer’s cognition and as a capital good playing a role in a production process. Human capital tied to the software is liquid between the software’s users. Surplus due to open software innovations goes first to the software users, then second to the ecosystem of developers who sell services around it. Contrast this with the proprietary case, where surplus goes mainly to a singular entity that owns and sells the software rights as a monopolist. The former case is vastly better if one considers societal equality a positive outcome.

This has straightforward policy implications. As an alternative to Picketty’s proposed tax on capital, any policies that encourage open source software are ones that combat societal inequality. This includes procurement policies, which need not increase government spending. On the contrary, if governments procure primarily open software, that should lead to savings over time as their investment leads to a more competitive market for services. Equivalently, R&D funding to open science institutions results in more income equality than equivalent funding provided to private companies.

Hirschman, Nigerian railroads, and poor open source user interfaces

Hirschman says he got the idea for Exit, Voice, and Loyalty when studying the failure of the Nigerian railroad system to improve quality despite the availability of trucking as a substitute for long-range shipping. Conventional wisdom among economists at the time was that the quality of a good would suffer when it was provisioned by a monopoly. But why would a business that faced healthy competition not undergo the management changes needed to improve quality?

Hirschman’s answer is that because the trucking option was so readily available as an alternative, there wasn’t a need for consumers to develop their capacity for voice. The railroads weren’t hearing the complaints about their service, they were just seeing a decline in use as their customers exited. Meanwhile, because it was a monopoly, loss in revenue wasn’t “of utmost gravity” to the railway managers either.

The upshot of this is that it’s only when customers are locked in that voice plays a critical role in the recuperation mechanism.

This is interesting for me because I’m interested in the role of lock-in in software development. In particular, one argument made in favor of open source software is that because it is not technology held by a single firm, users of the software are not locked-in. Their switching costs are reduced, making the market more liquid and, in theory favorable.

You can contrast this with proprietary enterprise software, where vendor lock-in is a principle part of the business model as this establishes the “installed base” and customer support armies are necessary for managing disgruntled customer voice. Or, in the case of social media such as Facebook, network effects create a kind of perceived consumer lock-in and consumer voice gets articulated by everybody from Twitter activists to journalists to high-profile academics.

As much as it pains me to admit it, this is one good explanation for why the user interfaces of a lot of open source software projects are so bad specifically if you combine this mechanism with the idea that user-centered design is important for user interfaces. Open source projects generally make it easy to complain about the software. If they know what they are doing at all, they make it clear how to engage the developers as a user. There is a kind of rumor out there that open source developers are unfriendly towards users and this is perhaps true when users are used to the kind of customer support that’s available on a product for which there is customer lock-in. It’s precisely this difference between exit culture and voice culture, driven by the fundamental economics of the industry, that creates this perception. Enterprise open source business models (I’m thinking about models like the Pentaho ‘beekeeper’) theoretically provide a corrective to this by being an intermediary between consumer voice and developer exit.

A testable hypothesis is whether and to what extent a software project’s responsiveness to tickets scales with the number of downstream dependent projects. In software development, technical architecture is a reasonable proxy for industrial organization. A widely used project has network effects that increasing switching costs for its downstream users. How do exit and voice work in this context?

The node.js fork — something new to think about

For Classics we are reading Albert Hirschman’s Exit, Voice, and Loyalty. Oddly, though normally I hear about ‘voice’ as an action from within an organization, the first few chapters of the book (including the introduction of the Voice concept itselt), are preoccupied with elaborations on the neoclassical market mechanism. Not what I expected.

I’m looking for interesting research use cases for BigBang, which is about analyzing the sociotechnical dynamics of collaboration. I’m building it to better understand open source software development communities, primarily. This is because I want to create a harmonious sociotechnical superintelligence to take over the world.

For a while I’ve been interested in Hadoop’s interesting case of being one software project with two companies working together to build it. This is reminiscent (for me) of when we started GeoExt at OpenGeo and Camp2Camp. The economics of shared capital are fascinating and there are interesting questions about how human resources get organized in that sort of situation. In my experience, there becomes a tension between the needs of firms to differentiate their products and make good on their contracts and the needs of the developer community whose collective value is ultimately tied to the robustness of their technology.

Unfortunately, building out BigBang to integrate with various email, version control, and issue tracking backends is a lot of work and there’s only one of me right now to both build the infrastructure, do the research, and train new collaborators (who are starting to do some awesome work, so this is paying off.) While integrating with Apache’s infrastructure would have been a smart first move, instead I chose to focus on Mailman archives and git repositories. Google Groups and whatever Apache is using for their email lists do not publish their archives in .mbox format, which is pain for me. But luckily Google Takeout does export data from folks’ on-line inbox in .mbox format. This is great for BigBang because it means we can investigate email data from any project for which we know an insider willing to share their records.

Does a research ethics issue arise when you start working with email that is openly archived in a difficult format, then exported from somebody’s private email? Technically you get header information that wasn’t open before–perhaps it was ‘private’. But arguably this header information isn’t personal information. I think I’m still in the clear. Plus, IRB will be irrelevent when the robots take over.

All of this is a long way of getting around to talking about a new thing I’m wondering about, the Node.js fork. It’s interesting to think about open source software forks in light of Hirschman’s concepts of Exit and Voice since so much of the activity of open source development is open, virtual communication. While you might at first think a software fork is definitely a kind of Exit, it sounds like IO.js was perhaps a friendly fork of just somebody who wanted to hack around. In theory, code can be shared between forks–in fact this was the principle that GitHub’s forking system was founded on. So there are open questions (to me, who isn’t involved in the Node.js community at all and is just now beginning to wonder about it) along the lines of to what extent a fork is a real event in the history of the project, vs. to what extent it’s mythological, vs. to what extent it’s a reification of something that was already implicit in the project’s sociotechnical structure. There are probably other great questions here as well.

A friend on the inside tells me all the action on this happened (is happening?) on the GitHub issue tracker, which is definitely data we want to get BigBang connected with. Blissfully, there appear to be well supported Python libraries for working with the GitHub API. I expect the first big hurdle we hit here will be rate limiting.

Though we haven’t been able to make integration work yet, I’m still hoping there’s some way we can work with MetricsGrimoire. They’ve been a super inviting community so far. But our software stacks and architecture are just different enough, and the layers we’ve built so far thin enough, that it’s hard to see how to do the merge. A major difference is that while MetricsGrimoire tools are built to provide application interfaces around a MySQL data backend, since BigBang is foremost about scientific analysis our whole data pipeline is built to get things into Pandas dataframes. Both projects are in Python. This too is a weird microcosm of the larger sociotechnical ecosystem of software production, of which the “open” side is only one (important) part.

Imre Lakatos and programming as dialectic

My dissertation is about the role of software in scholarly communication. Specifically, I’m interested in the way software code is itself a kind of scholarly communication, and how the informal communications around software production represent and constitute communities of scientists. I see science as a cognitive task accomplished by the sociotechnical system of science, including both scientists and their infrastructure. Looking particularly at scientist’s use of communications infrastructure such as email, issue trackers, and version control, I hope to study the mechanisms of the scientific process much like a neuroscientist studies the mechanisms of the mind by studying neural architecture and brainwave activity.

To get a grip on this problem I’ve been building BigBang, a tool for collecting data from open source projects and readying it for scientific analysis.

I have also been reading background literature to give my dissertation work theoretical heft and to procrastinate from coding. This is why I have been reading Imre Lakatos’ Proofs and Refutations (1976).

Proofs and Refutations is a brilliantly written book about the history of mathematical proof. In particular, it is an analysis of informal mathematics through an investigation of the letters written by mathematicians working on proofs about the Euler characteristic of polyhedra in the 18th and 19th centuries.

Whereas in the early 20th century, based on the work of Russel and Whitehead and others, formal logic was axiomatized, prior to this mathematical argumentation had less formal grounding. As a result, mathematicians would argue not just substantively about the theorem they were trying to prove or disprove, but also about what constitutes a proof, a conjecture, or a theorem in the first place. Lakatos demonstrates this by condensing 200+ years of scholarly communication into a fictional, impassioned classroom dialog where characters representing mathematicians throughout history banter about polyhedra and proof techniques.

What’s fascinating is how convincingly Lakatos presents the progress of mathematical understanding as an example of dialectical logic. Though he doesn’t use the word “dialectical” as far as I’m aware, he tells the story of the informal logic of pre-Russellian mathematics through dialog. But this dialog is designed to capture the timeless logic behind what’s been said before. It takes the reader through the thought process of mathematical discovery in abbreviated form.

I’ve had conversations with serious historians and ethnographers of science who would object strongly to the idea of a history of a scientific discipline reflecting a “timeless logic”. Historians are apt to think that nothing is timeless. I’m inclined to think that the objectivity of logic persists over time much the same way that it persists over space and between subjects, even illogical ones, hence its power. These are perhaps theological questions.

What I’d like to argue (but am not sure how) is that the process of informal mathematics presented by Lakatos is strikingly similar to that used by software engineers. The process of selecting a conjecture, then of writing a proof (which for Lakatos is a logical argument whether or not it is sound or valid), then having it critiqued with counterexamples, which may either be global (counter to the original conjecture) or local (counter to a lemma), then modifying the proof, then perhaps starting from scratch based on a new insight… all this reads uncannily like the process of debugging source code.

The argument for this correspondence is strengthened by later work in theory of computation and complexity theory. I learned this theory so long ago I forget who to attribute it to, but much of the foundational work in computer science was the establishment of a correspondence between classes of formal logic and classes of programming languages. So in a sense its uncontroversial within computer science to consider programs to be proofs.

As I write I am unsure whether I’m simply restating what’s obvious to computer scientists in an antiquated philosophical language (a danger I feel every time I read a book, lately) or if I’m capturing something that could be an interesting synthesis. But my point is this: that if programming language design and the construction of progressively more powerful software libraries is akin to the expanding of formal mathematical knowledge from axiomatic grounds, then the act of programming itself is much more like the informal mathematics of pre-Russellian mathematics. Specifically, in that it is unaxiomatic and proofs are in play without necessarily being sound. When we use a software system, we are depending necessarily on a system of imperfected proofs that we fix iteratively through discovered counterexamples (bugs).

Is it fair to say, then, that whereas the logic of software is formal, deductive logic, the logic of programming is dialectical logic?

Bear with me; let’s presume it is. That’s a foundational idea of my dissertation work. Proving or disproving it may or may not be out of scope of the dissertation itself, but it’s where it’s ultimately headed.

The question is whether it is possible to develop a formal understanding of dialectical logic through a scientific analysis of the software collaboration. (see a mathematical model of collective creativity). If this could be done, then we could then build better software or protocols to assist this dialectical process.

technical work

Dipping into Julian Orr’s Talking about Machines, an ethnography of Xerox photocopier technicians, has set off some light bulbs for me.

First, there’s Orr’s story: Orr dropped out of college and got drafted, then worked as a technician in the military before returning to school. He paid the bills doing technical repair work, and found it convenient to do his dissertation on those doing photocopy repair.

Orr’s story reminds me of my grandfather and great-uncle, both of whom were technicians–radio operators–during WWII. Their civilian careers were as carpenters, building houses.

My own dissertation research is motivated by my work background as an open source engineer, and my own desire to maintain and improve my technical chops. I’d like to learn to be a data scientist; I’m also studying data scientists at work.

Further fascinating was Orr’s discussion of the Xerox technician’s identity as technicians as opposed to customers:

The distinction between technician and customer is a critical division of this population, but for technicians at work, all nontechnicians are in some category of other, including the corporation that employs the technicians, which is seen as alien, distant, and only sometimes an ally.

It’s interesting to read about this distinction between technicians and others in the context of Xerox photocopiers when I’ve been so affected lately by the distinction between tech folk and others and data scientists and others. This distinction between those who do technical work and those who they serve is a deep historical one that transcends the contemporary and over-computed world.

I recall my earlier work experience. I was a decent engineer and engineering project manager. I was a horrible account manager. My customer service skills were abysmal, because I did not empathize with the client. The open source context contributes to this attitude, because it makes a different set of demands on its users than consumer technology does. One gets assistance with consumer grade technology by hiring a technician who treats you as a customer. You get assistance with open source technology by joining the community of practice as a technician. Commercial open source software, according to the Pentaho beekeeper model, is about providing, at cost, that customer support.

I’ve been thinking about customer service and reflecting on my failures at it a lot lately. It keeps coming up. Mary Gray’s piece, When Science, Customer Service, and Human Subjects Research Collide explicitly makes the connection between commercial data science at Facebook and customer service. The ugly dispute between Gratipay (formerly Gittip) and Shanley Kane was, I realized after the fact, a similar crisis between the expectations of customers/customer service people and the expectations of open source communities. When “free” (gratis) web services display a similar disregard for their users as open source communities do, it’s harder to justify in the same way that FOSS does. But there are similar tensions, perhaps. It’s hard for technicians to empathize with non-technicians about their technical problems, because their lived experience is so different.

It’s alarming how much is being hinged on the professional distinction between technical worker and non-technical worker. The intra-technology industry debates are thick with confusions along these lines. What about marketing people in the tech context? Sales? Are the “tech folks” responsible for distributional justice today? Are they in the throws of an ideology? I was reading a paper the other day suggesting that software engineers should be held ethically accountable for the implicit moral implications of their algorithms. Specifically the engineers; for some reason not the designers or product managers or corporate shareholders, who were not mentioned. An interesting proposal.

Meanwhile, at the D-Lab, where I work, I’m in the process of navigating my relationship between two teams, the Technical Team, and the Services Team. I have been on the Technical team in the past. Our work has been to stay on top of and assist people with data science software and infrastructure. Early on, we abolished regular meetings as a waste of time. Naturally, there was a suspicion expressed to me at one point that we were unaccountable and didn’t do as much work as others on the Services team, which dealt directly with the people-facing component of the lab–scheduling workshops, managing the undergraduate work-study staff. Sitting in on Services meetings for the first time this semester, I’ve been struck by how much work the other team does. By and large, it’s information work: calendering, scheduling, entering into spreadsheets, documenting processes in case of turnover, sending emails out, responding to emails. All important work.

This is exactly the work that information technicians want to automate away. If there is a way to reduce the amount of calendering and entering into spreadsheets, programmers will find a way. The whole purpose of computer science is to automate tasks that would otherwise be tedious.

Eric S. Raymond’s classic (2001) essay How to Become a Hacker characterizes the Hacker Attitude, in five points:

  1. The world is full of fascinating problems waiting to be solved.
  2. No problem should ever have to be solved twice.
  3. Boredom and drudgery are evil.
  4. Freedom is good.
  5. Attitude is no substitute for competence.

There is no better articulation of the “ideology” of “tech folks” than this, in my opinion, yet Raymond is not used much as a source for understanding the idiosyncracies of the technical industry today. Of course, not all “hackers” are well characterized by Raymond (I’m reminded of Coleman’s injunction to speak of “cultures of hacking”) and not all software engineers are hackers (I’m sure my sister, a software engineer, is not a hacker. For example, based on my conversations with her, it’s clear that she does not see all the unsolved problems with the world to be intrinsically fascinating. Rather, she finds problems that pertain to some human interest, like children’s education, to be most motivating. I have no doubt that she is a much better software engineer than I am–she has worked full time at it for many years and now works for a top tech company. As somebody closer to the Raymond Hacker ethic, I recognize that my own attitude is no substitute for that competence, and hold my sister’s abilities in very high esteem.)

As usual, I appear to have forgotten where I was going with this.

Protected: some ruminations regarding ‘openness’

This content is password protected. To view it please enter your password below:

Reflections on the Berkeley Institute for Data Science (BIDS) Launch

Last week was the launch of the Berkeley Institute for Data Science.

Whatever might actually happen as a result of the launch, what was said at the launch was epic.

Vice Chancellor of research Graham Flemming introduced Chancellor Nicholas Dirks for the welcoming remarks. Dirks is UC Berkeley’s 10th Chancellor. He succeeded Robert Birgeneau, who resigned gracefully shortly after coming under heavy criticism for his handling of Occupy Cal, the Berkeley campus’ chapter of the Occupy movement. He was distinctly unsympathetic to the protesters, and there was a widely circulated petition declaring a lack of confidence in his leadership. Birgeneau is a physicist. Dirks is an anthropologist who has championed postcolonial approaches. Within the politics of the university, which are a microcosm of politics at large, this signalling is clear. Dirks’ appointment was meant to satisfy the left wing protesters, most of whom have been trained in softer social sciences themselves. Critical reflection on power dynamics and engagement in activism–which is often associated with leftist politics–are, at least formally, accepted by the university administration as legitimate. Birgeneau would subsequently receive awards for his leadership in drawing more women into the sciences and aiding undocumented students.

Dirks’ welcoming remarks were about the great accomplishments of UC Berkeley as a research institution the vague but extraordinary potential of BIDS. He is grateful, as we all are, for the funding from the Moore and Sloan foundations. I found his remarks unspecific, and I couldn’t help but wonder what his true thoughts were about data science in the university. Surely he must have an opinion. As an anthropologist, can he consistently believe that data science, especially in the social sciences, is the future?

Vicki Chandler, Chief Program Officer from the Moore Foundation, was more lively. Pulling no punches, she explained that the purpose of BIDS is to shake up scientific culture. Having hung out in Berkeley in the 60’s and attended it as an undergraduate in the 70’s, she believes we are up for it. She spoke again and again of “revolution”. There is ambiguity in this. In my experience, faculty are divided on whether they see the proposed “open science” changes as imminent or hype, as desirable or dangerous. More and more I see faculty acknowledge that we are witnessing the collapse of the ivory tower. It is possible that the BIDS launch is a tipping point. What next? “Let the fun begin!” concluded Chandler.

Saul Perlmutter, Nobel laureate physicist and front man of the BIDS co-PI super group, gave his now practiced and condensed pitch for the new Institute. He hit all the high points, pointing out not only the potential of data science but the importance of changing the institutions themselves. Rethinking the peer-review journal from scratch, he said, we should focus more on code reuse. Software can be a valid research output. As much as open science is popular among the new generation of scientists, this is a bold statement for somebody with such credibility within the university. He even said that the success of open source software is what gives us hope for the revolutionary new kind of science BIDS is beginning. Two years ago, this was a fringe idea. Perlmutter may have just made it mainstream.

Notably, he also engaged with the touchy academic politics, saying that data science could bring diversity to the sciences (though he was unspecific about the mechanism for this). He expounded on the important role of ethnography in evaluating the Institute to identify the bottlenecks to its unlocking its potential.

The man has won at physics and is undoubtedly a scientist par excellance. Perhaps Perlmutter sees the next part of his legacy as the bringing of the university system into the 21st century.

David Culler, Chair of the Electrical Engineering and Computer Science department, then introduced a number of academic scientists, each with impressive demonstrations about how data science could be applied to important problems like climate change and disaster reduction. Much of this research depends on using the proliferation of hand-held mobile devices as sensors. University science, I realized while watching this, is at its best when doing basic research about how to save humanity from nature or ourselves.

But for me the most interesting speakers in the first half of the launch were luminaries Peter Norvig and Tim O’Reilly, each giants in their own right and welcome guests to the university.

Culler introduced Norvig, Director of Research at Google, by crediting him as one of the inventors of the MOOC. I know his name mainly as a co-author of “Artificial Intelligence: A Modern Approach,” which I learned and taught from as an undergraduate. Amazingly, Norvig’s main message is about the economics of the digital economy. Marginal production is cheap, cost of communication is cheap, and this leads to an accumulation of wealth. Fifty percent of jobs are predicted to be automated away in the coming decades. He is worried about the 99%–freely using Occupy rhetoric. What will become of them? Norvig’s solution, perhaps stated tongue in cheek, is that everyone needs to become a data scientist. More concretely, he has high hopes for hybrid teams of people and machines, that all professions will become like this. By defining what academic data science looks like and training the next generation of researchers, BIDS will have a role in steering the balance of power between humanity and the machines–and the elite few who own them.

His remarks hit home. He touched on anxieties that are as old as the Industrial Revolution: is somebody getting immensely rich off of these transformations, but not me? What will my role be in this transformed reality? Will I find work? These are real problems and Norvig was brave to bring them up. The academics in the room were not immune from these anxieties either, as they watch the ivory tower crumble around them. This would come up again later in the day.

I admire him for bringing up the point, and I believe he is sincere. I’d heard him make the same points when he was on a panel with Neil Stephenson and Jaron Lanier a month or so earlier. I can’t help but be critical of Norvig’s remarks. Is he covering his back? Many university professors are seeing MOOCs themselves as threatening to their own careers. It is encouraging that he sees the importance of hybrid human/machine teams. If the machines are built on Google infrastructure, doesn’t this contribute to the same inequality he laments, shifting power away from teachers to the 1% at Google? Or does he foresee a MOOC-based educational boom?

He did not raise the possibility that human/machine hybridity is already the status quo–that, for example, all information workers tap away at these machines and communicate with each other through a vast technical network. If he had acknowledged that we are all cyborgs already, he would have had to admit that hybrid teams of humans and machines are as much the cause of as solution to economic inequality. Indeed, this relationship between human labor and mechanical capital is precisely the same as the one that created economic inequality in the Industrial Revolution. When the capital is privately owned, the systems of hybrid human/machine productivity favor the owner of the machines.

I have high hopes that BIDS will address through its research Norvig’s political concern. It is certainly on the mind of some of its co-PI’s, as later discussion would show. But to address the problem seriously, it will have to look at the problem in a rigorous way that doesn’t shy away from criticism of the status quo.

The next speaker, Tim O’Reilly, is a figure who fascinates me. Culler introduced him as a “God of the Open Source Field,” which is poetically accurate. Before coming to academia, I worked on Web 2.0 open source software platforms for open government. My career was defined by a string of terms invented and popularized by O’Reilly, and to a large extent I’m still a devotee of his ideas. But as a practitioner and researcher, I’ve developed a nuanced view of the field that I’ve tried to convey in the course on Open Collaboration and Peer Production I’ve co-instructed with Thomas Maillart this semeser.

O’Reilly came under criticism earlier this year from Evgeny Morozov, who attacked him for marketing politically unctuous ideas while claiming to be revolutionary. He focuses on his promotion of ‘open source’ over and against Richard Stallman’s explicitly ethical and therefore contentious term ‘free software‘. Morozov accuses O’Reilly of what Tom Scocca has recently defined as rhetorical smarm–dodging specific criticism by denying the appropriateness of criticism in general. O’Reilly has disputed the Morozov piece. Elsewhere he has presented his strategy as a ‘marketer of big ideas‘, and his deliberate promoting of more business-friendly ‘open source’ rhetoric. This ideological debate is itself quite interesting. Geek anthropologist Chris Kelty observes that it is participation in this debate, more so than an adherence to any particular view in it, that characterizes the larger “movement,” which he names the recursive public.

Despite his significance to me, with an open source software background, I was originally surprised when I heard Tim O’Reilly would be speaking at the BIDS launch. O’Reilly had promoted ‘open source’ and ‘Web 2.0’ and ‘open government’, but what did that have to do with ‘data science’?

So I was amused when Norvig introduced O’Reilly by saying that he didn’t know he was a data scientist until the latter wrote an article in Forbes (in November 2011) naming him one of “The World’s 7 Most Powerful Data Scientists.” Looking at the Google Trends data, we can see that November 2011 just about marks the rise of ‘data science’ from obscurity to popularity. Is Tim O’Reilly responsible for the rise of ‘data science’?

Perhaps. O’Reilly’s explained that he got into data science by thinking about the end game for open source. As open source software becomes commodified (which for him I think means something like ‘subject to competitive market pressure), what becomes valuable is the data. And so he has been promoting data science in industry and government, and believes that the university can learn important lessons from those fields as well. He held up his Moto X phone, explained how it is ‘always listening’ and so can facilitate services like Google Now. All this would go towards a system with greater collective intelligence, a self-regulating system that would make regulators obsolete.

Looking at the progression of the use of maps, from paper to digital to being embedded in services and products like self-driving cars, O’Reilly agrees with Norvig about the importance of human-machine interaction. In particular, he believes that data scientists will need to know how to ask the right questions about data, and that this is the future of science. “Others will be left behind,” he said, not intending to sound foreboding.

I thought O’Reilly presented the combination of insight and boosterism I expected. To me, his presence at the BIDS launch meant to me that O’Reilly’s significance as a public intellectual has progressed from business through governance and now to scientific thinking itself. This is wonderful for him but means that his writings and influence should be put under the scrutiny we would have for an academic peer. It is appropriate to call him out for glossing over the privacy issues around a mobile phone that is “always listening,” or the moral implications of the obsolescence of regulators for equality and justice. Is his objectivity compromised by the fact that he runs a publishing company that sells complementary goods to the vast supply of publicly available software and data? Does his business agenda incentivize him to obscure the subtle differences between various segements of his market? Are we in the university victims of that obscurity as we grapple with multiple conflated meanings of “openness” in software and science (open to scrutiny and accountability, vs. open for appropriation by business, vs. open to meritocratic contribution)? As we ask these questions, we can be grateful to O’Reilly for getting us this far.

I’ve emphasized the talks given by Norvig and O’Reilly because they exposed what I think are some of the most interesting aspects of BIDS. One way or another, it will be revolutionary. Its funders will be very disappointed if it is not. But exactly how it is revolutionary is undetermined. The fact that BIDS is based in Berkeley, and not in Google or Microsoft or Stanford, guarantees that the revolution will not be an insipid or smarmy one which brushes aside political conflict or morality. Rather, it promises to be the site of fecund political conflict. “Let the fun begin!” said Chandler.

The opening remarks concluded and we broke for lunch and poster sessions–the Data Science Faire (named after O’Reilly’s Maker Faire…

What followed was a fascinating panel discussion led by astrophysicist Josh Bloom, historian and university administrator Cathryn Carson, computer science professor and AMP Lab director Michael Franklin, and Deb Agrawal, a staff computer scientist for Lawrence Berkeley National Lab.

Bloom introduced the discussion jokingly as “just being among us scientists…and whoever is watching out there on the Internet,” perhaps nodding to the fact that the scientific community is not yet fully conscious that their expectations of privileged communication are being challenged by a world and culture of mobile devices that are “always listening.”

The conversation was about the role of people in data science.

Carson spoke as a domain scientist–a social scientist who studies scientists. Noting that social scientists tend to work in small teams lead by graduate students motivated by their particular questions, she said her emphasis was on the people asking questions. Agrawal noted that the number of people needed to analyze a data set does not scale with the size of data, but the complexity of data–a practical point. (I’d argue that theoretically we might want to consider “size” of data in terms of its compressibility–which would reflect its complexity. This ignores a number of operational challenges.) For Franklin, people are a computational resource that can be part of a crowd-sourced process. In that context, the number of people needed does indeed scale with the use of people as data processors and sensors.

Perhaps to follow through on Norvig’s line of reasoning, Bloom then asked pointedly if machines would ever be able to do the asking of questions better than human beings. In effect: Would data science make data scientists obsolete?

Nobody wanted to be the first to answer this question. Bloom had to repeat it.

Agrawal took a first stab at it. The science does not come from the data; the scientist chooses models and tests them. This is the work of people. Franklin agreed and elaborated–the wrong data too early can ruin the science. Agrawal noted that computers might find spurious signals in the noise.

Personally, I find these unconvincing answers to Bloom’s question. Algorithms can generate, compare, and test alternative models against the evidence. Noise can, with enough data, be filtered away from the signal. To do so pushes the theoretical limits of computing and information theory, but if Franklin is correct in his earlier point that people are part of the computational process, then there is no reason in principle why these tasks too might not be performed if not assisted by computers.

Carson, who had been holding back her answer to listen to the others, had a bolder proposal: rather than try to predict the future of science, why not focus on the task of building that future?

In another universe, at that moment someone might have asked the one question no computer could have answered. “If we are building the new future of science, what should we build? What should it look like? And how do we get there?” But this is the sort of question disciplined scientists are trained not to ask.

Instead, Bloom brought things back to practicality: we need to predict where science will go in order to know how to educate the next generation of scientists. Should we be focusing on teaching them domain knowledge, or on techniques?

We have at the heart of BIDS the very fundamental problem of free will. Bloom suggests that if we can predict the future, then we can train students in anticipation of it. He is an astrophysics and studies stars; he can be forgiven for the assumption that bodies travel in robust orbits. This environment is a more complex one. How we choose to train students now will undoubtedly affect how science evolves, as the process of science is at once the process of learning and training new scientists. His descriptive question then falls back to the normative one: what science are we trying to build toward?

Carson was less heavy-handed than I would have been in her position. Instead, she asked Bloom how he got interested in data science. Bloom recalled his classical physics training, and the moment he discovered that to answer the kinds of questions he was asking, he would need new methods.

Franklin chimed in on the subject of education. He has heard it said that everyone in the next generation should learn to code. With marked humility for his discipline, he said he did not agree with this. But he said he did believe that everyone in the next generation should learn data literacy, echoing Norvig.

Bloom opened the discussion to questions from the audience.

The first was about the career paths for methodologists who write software instead of papers. How would BIDS serve them? It was a soft ball question which the panel hit out of the park. Bloom noted that the Moore and Sloan funders explicitly asked for the development of alternative metrics to measure the impact of methodologist contributions. Carson said that even with the development of metrics, as an administrator she knew it would be a long march through the institution to get those metrics recognized. There was much work to be done. “Universities got to change,” she rallied. “If we don’t change, Berkeley’s being great in the past won’t make it great in the future,” referring perhaps to the impressive history of research recounted by Chancellor Dirks. There was applause. Franklin pointed out that the open source community has its own metrics already. In some circles some of his students are more famous than he is for developing widely used software. Investors are often asking him when his students will graduate. The future, it seems, is bright for methodologists.

At this point I lost my Internet connection and had to stop livetweeting the panel; those tweets are the notes from which I am writing these reflections. Recalling from memory, there was one more question from Kristina Kangas, a PhD student in Integrative Biology. She cited research about how researchers interpreting data wind up reflecting back their own biases. What did this mean for data science?

Bloom gave Carson the last word. It is a social scientific fact, she said, that scientists interpret data in ways that fit their own views. So it’s possible that there is no such thing as “data literacy”. These are open questions that will need to be settled by debate. Indeed, what then is data science after all? Turning to Bloom, she said, “I told you I would be making trouble.”