Digifesto

Tag: bigbang

thinking about meritocracy in open source communities

There has been a trend in open source development culture over the past ten years or so. It is the rejection of ‘meritocracy’. Just now, I saw this Post-Meritocracy Manifesto, originally created by Coraline Ada Ehmke. It is exactly what it sounds like: an explicit rejection of meritocracy, specifically in open source development. It captures a recent progressive wing of software development culture. It is attracting signatories.

I believe this is a “trend” because I’ve noticed a more subtle expression of similar ideas a few months ago. This came up when we were coming up with a Code of Conduct for BigBang. We wound up picking the Contributor Covenant Code of Conduct, though there’s still some open questions about how to integrate it with our Governance policy.

This Contributor Covenant is widely adopted and the language of it seems good to me. I was surprised though when I found the rationale for it specifically mentioned meritocracy as a problem the code of conduct was trying to avoid:

Marginalized people also suffer some of the unintended consequences of dogmatic insistence on meritocratic principles of governance. Studies have shown that organizational cultures that value meritocracy often result in greater inequality. People with “merit” are often excused for their bad behavior in public spaces based on the value of their technical contributions. Meritocracy also naively assumes a level playing field, in which everyone has access to the same resources, free time, and common life experiences to draw upon. These factors and more make contributing to open source a daunting prospect for many people, especially women and other underrepresented people.

If it looks familiar, it may be because it was written by the same author, Coraline Ada Ehmke.

I have to admit that though I’m quite glad that we have a Code of Conduct now in BigBang, I’m uncomfortable with the ideological presumptions of its rationale and the rejection of ‘meritocracy’. There is a lot packed into this paragraph that is open to productive disagreement and which is not necessary for a commitment to the general point that harassment is bad for an open source community.

Perhaps this would be easier for me to ignore if this political framing did not mirror so many other political tensions today, and if open source governance were not something I’ve been so invested in understanding. I’ve taught a course on open source management, and BigBang spun out of that effort as an experiment in scientific analysis of open source communities. I am, I believe, deep in on this topic.

So what’s the problem? The problem is that I think there’s something painfully misaligned about criticism of meritocracy in culture at large and open source development, which is a very particular kind of organizational form. There is also perhaps a misalignment between the progressive politics of inclusion expressed in these manifestos and what many open source communities are really trying to accomplish. Surely there must be some kind of merit that is not in scare quotes, or else there would not be any good open source software to use a raise a fuss about.

Though it does not directly address the issue, I’m reminded of an old email discussion on the Numpy mailing list that I found when I was trying to do ethnographic work on the Scientific Python community. It was a response by John Hunter, the creator of Matplotlib, in response to concerns raised when Travis Oliphant, the leader of NumPy, started Continuum Analytics and there were concerns about corporate control over NumPy. Hunter quite thoughtfully, in my opinion, debunked the idea that open source governance should be a ‘democracy’, like many people assume institutions ought to be by default. After a long discussion about how Travis had great merit as a leader, he argued:

Democracy is something that many of us have grown up by default to consider as the right solution to many, if not most or, problems of governance. I believe it is a solution to a specific problem of governance. I do not believe democracy is a panacea or an ideal solution for most problems: rather it is the right solution for which the consequences of failure are too high. In a state (by which I mean a government with a power to subject its people to its will by force of arms) where the consequences of failure to submit include the death, dismemberment, or imprisonment of dissenters, democracy is a safeguard against the excesses of the powerful. Generally, there is no reason to believe that the simple majority of people polled is the “best” or “right” answer, but there is also no reason to believe that those who hold power will rule beneficiently. The democratic ability of the people to check to the rule of the few and powerful is essential to insure the survival of the minority.

In open source software development, we face none of these problems. Our power to fork is precisely the power the minority in a tyranical democracy lacks: noone will kill us for going off the reservation. We are free to use the product or not, to modify it or not, to enhance it or not.

The power to fork is not abstract: it is essential. matplotlib, and chaco, both rely *heavily* on agg, the Antigrain C++ rendering library. At some point many years ago, Maxim, the author of Agg, decided to change the license of Agg (circa version 2.5) to GPL rather than BSD. Obviously, this was a non-starter for projects like mpl, scipy and chaco which assumed BSD licensing terms. Unfortunately, Maxim had a new employer which appeared to us to be dictating the terms and our best arguments fell on deaf ears. No matter: mpl and Enthought chaco have continued to ship agg 2.4, pre-GPL, and I think that less than 1% of our users have even noticed. Yes, we forked the project, and yes, noone has noticed. To me this is the ultimate reason why governance of open source, free projects does not need to be democratic. As painful as a fork may be, it is the ultimate antidote to a leader who may not have your interests in mind. It is an antidote that we citizens in a state government may not have.

It is true that numpy exists in a privileged position in a way that matplotlib or scipy does not. Numpy is the core. Yes, Continuum is different than STScI because Travis is both the lead of Numpy and the lead of the company sponsoring numpy. These are important differences. In the worst cases, we might imagine that these differences will negatively impact numpy and associated tools. But these worst case scenarios that we imagine will most likely simply distract us from what is going on: Travis, one of the most prolific and valuable contributers to the scientific python community, has decided to refocus his efforts to do more. And that is a very happy moment for all of us.

This is a nice articulation of how forking, not voting, is the most powerful governance mechanism in open source development, and how it changes what our default assumptions about leadership ought to be. A critical but I think unacknowledged question is to how the possibility of forking interacts with the critique of meritocracy in organizations in general, and specifically what that means for community inclusiveness as a goal in open source communities. I don’t think it’s straightforward.

Note: Nick Doty has written a nice response to this on his blog.

moved BigBang core repository to DATACTIVE organization

I made a small change this evening which I feel really, really good about.

I transferred the BigBang project from my personal GitHub account to the datactive organization.

I’m very grateful for DATACTIVE‘s interest in BigBang and am excited to turn over the project infrastructure to their stewardship.

The node.js fork — something new to think about

For Classics we are reading Albert Hirschman’s Exit, Voice, and Loyalty. Oddly, though normally I hear about ‘voice’ as an action from within an organization, the first few chapters of the book (including the introduction of the Voice concept itselt), are preoccupied with elaborations on the neoclassical market mechanism. Not what I expected.

I’m looking for interesting research use cases for BigBang, which is about analyzing the sociotechnical dynamics of collaboration. I’m building it to better understand open source software development communities, primarily. This is because I want to create a harmonious sociotechnical superintelligence to take over the world.

For a while I’ve been interested in Hadoop’s interesting case of being one software project with two companies working together to build it. This is reminiscent (for me) of when we started GeoExt at OpenGeo and Camp2Camp. The economics of shared capital are fascinating and there are interesting questions about how human resources get organized in that sort of situation. In my experience, there becomes a tension between the needs of firms to differentiate their products and make good on their contracts and the needs of the developer community whose collective value is ultimately tied to the robustness of their technology.

Unfortunately, building out BigBang to integrate with various email, version control, and issue tracking backends is a lot of work and there’s only one of me right now to both build the infrastructure, do the research, and train new collaborators (who are starting to do some awesome work, so this is paying off.) While integrating with Apache’s infrastructure would have been a smart first move, instead I chose to focus on Mailman archives and git repositories. Google Groups and whatever Apache is using for their email lists do not publish their archives in .mbox format, which is pain for me. But luckily Google Takeout does export data from folks’ on-line inbox in .mbox format. This is great for BigBang because it means we can investigate email data from any project for which we know an insider willing to share their records.

Does a research ethics issue arise when you start working with email that is openly archived in a difficult format, then exported from somebody’s private email? Technically you get header information that wasn’t open before–perhaps it was ‘private’. But arguably this header information isn’t personal information. I think I’m still in the clear. Plus, IRB will be irrelevent when the robots take over.

All of this is a long way of getting around to talking about a new thing I’m wondering about, the Node.js fork. It’s interesting to think about open source software forks in light of Hirschman’s concepts of Exit and Voice since so much of the activity of open source development is open, virtual communication. While you might at first think a software fork is definitely a kind of Exit, it sounds like IO.js was perhaps a friendly fork of just somebody who wanted to hack around. In theory, code can be shared between forks–in fact this was the principle that GitHub’s forking system was founded on. So there are open questions (to me, who isn’t involved in the Node.js community at all and is just now beginning to wonder about it) along the lines of to what extent a fork is a real event in the history of the project, vs. to what extent it’s mythological, vs. to what extent it’s a reification of something that was already implicit in the project’s sociotechnical structure. There are probably other great questions here as well.

A friend on the inside tells me all the action on this happened (is happening?) on the GitHub issue tracker, which is definitely data we want to get BigBang connected with. Blissfully, there appear to be well supported Python libraries for working with the GitHub API. I expect the first big hurdle we hit here will be rate limiting.

Though we haven’t been able to make integration work yet, I’m still hoping there’s some way we can work with MetricsGrimoire. They’ve been a super inviting community so far. But our software stacks and architecture are just different enough, and the layers we’ve built so far thin enough, that it’s hard to see how to do the merge. A major difference is that while MetricsGrimoire tools are built to provide application interfaces around a MySQL data backend, since BigBang is foremost about scientific analysis our whole data pipeline is built to get things into Pandas dataframes. Both projects are in Python. This too is a weird microcosm of the larger sociotechnical ecosystem of software production, of which the “open” side is only one (important) part.

picking a data backend for representing email in #python

I’m at a difficult crossroads with BigBang where I need to pick an appropriate data storage backend for my preprocessed mailing list data.

There are a lot of different aspects to this problem.

The first and most important consideration is speed. If you know anything about computer science, you know that it exists to quickly execute complex tasks that would take too long to do by hand. It’s odd writing that sentence since computational complexity considerations are so fundamental to algorithm design that this can go unspoken in most technical contexts. But since coming to grad school I’ve found myself writing for a more diverse audience, so…

The problem I’m facing is that in doing exploratory data analysis, I do not know all the questions I am going to ask yet. But any particular question will be impractical to ask unless I tune the underlying infrastructure to answer it. This chicken-and-egg problem means that the process of inquiry is necessarily constrained by the engineering options that are available.

This is not new in scientific practice. Notoriously, the field of economics in the 20th century was shaped by what was analytically tractable as formal, mathematical results. The nuance of contemporary modeling of complex systems is due largely to the fact that we now have computers to do this work for us. That means we can still have the intersubjectively verified rigor that comes with mathematization without trying to fit square pegs into round holes. (Side note: something mathematicians acknowledge that others tend to miss is that mathematics is based on dialectic proof and intersubjective agreement. This makes it much closer epistemologically to something like history as a discipline than it is to technical fields dedicated to prediction and control, like chemistry or structural engineering. Computer science is in many ways an extension of mathematics. Obviously, these formalizations are then applied to great effect. Their power comes from their deep intersubjective validity–in other words, their truth. Disciplines that have dispensed with intersubjective validity as a grounds for truth claims in favor of a more nebulous sense of diverse truths in a manifold of interpretation have difficulty understanding this and so are likely to see the institutional gains of computer scientists to be a result of political manipulation, as opposed to something more basic: mastery of nature, or more provacatively, use of force. This disciplinary disfunction is one reason why these groups see their influence erode.)

For example, I have determined that in order to implement a certain query on the data efficiently, it would be best if another query were constant time. One way to do this is to use a database with an index.

However, setting up a database is something that requires extra work on the part of the programmer and so makes it harder to reproduce results. So far I have been keeping my processed email data “in memory” after it is pulled from files on the file system. This means that I have access to the data within the programming environment I’m most comfortable with, without depending on an external or parallel process. Fewer moving parts means that it is simpler to do my work.

So there is a tradeoff between the computational time of the software as it executes and the time and attention is takes me (and others that want to reproduce my results) to set up the environment in which the software runs. Since I am running this as an open source project and hope others will build on my work, I have every reason to be lazy, in a certain sense. Every inconvenience I suffer is one that will be suffered by everyone that follows me. There is a Kantian categorical imperative to keep things as simple as possible for people, to take any complex procedure and replace it with a script, so that others can do original creative thinking, solve the next problem. This is the imperative that those of us embedded in this culture have internalized. (G. Coleman notes that there are many cultures of hacking; I don’t know how prevalent these norms are, to be honest; I’m speaking from my experience) It is what makes this social process of developing our software infrastructure a social one with a modernist sense of progress. We are part of something that is being built out.

There are also social and political considerations. I am building this project intentionally in a way that is embedded within the Scientific Python ecosystem, as they are also my object of study. Certain projects are trendy right now, and for good reason. At the Python Worker’s Party at Berkeley last Friday, I saw a great presentation of Blaze. Blaze is a project that allows programmers experienced with older idioms of scientific Python programming to transfer their skills to systems that can handle more data, like Spark. This is exciting for the Python community. In such a fast moving field with multiple interoperating ecosystems, there is always the anxiety that ones skills are no longer the best skills to have. Has your expertise been made obsolete? So there is a huge demand for tools that adapt one way of thinking to a new system. As more data has become available, people have engineered new sophisticated processing backends. Often these are not done in Python, which has a reputation for being very usable and accessible but slow to run in operation. Getting the usable programming interface to interoperate with the carefully engineered data backends is hard work, work that Matt Rocklin is doing while being paid by Continuum Analytics. That is sweet.

I’m eager to try out Blaze. But as I think through the questions I am trying to ask about open source projects, I’m realizing that they don’t fit easily into the kind of data processing that Blaze currently supports. Perhaps this is dense on my part. If I knew better what I was asking, I could maybe figure out how to make it fit. But probably, what I’m looking at is data that is not “big”, that does not need the kind of power that these new tools provide. Currently my data fits on my laptop. It even fits in memory! Shouldn’t I build something that works well for what I need it for, and not worry about scaling at this point?

But I’m also trying to think long-term. What happens if an when it does scale up? What if I want to analyze ALL the mailing list data? Is that “big” data?

“Premature optimization is the root of all evil.” – Donald Knuth

Protected: the research lately

This content is password protected. To view it please enter your password below: