Category: open source software

Sample UC Berkeley School of Information Preliminary Exam

I’m in the PhD program at UC Berkeley’s School of Information. Today, I had to turn in my Preliminary Exam, a 24-hour open book, open note examination on the chosen subject areas of my coursework. I got to pick an exam committee of three faculty members, one for each area of speciality. My committee consisted of: Doug Tygar, examining me on Information System Design; John Chuang, the committee chair, examining me on Information Economics and Policy; and Coye Cheshire, examining me on Social Aspects of Information. Each asked me a question corresponding to their domain; generously, they targeted their questions at my interests.

In keeping with my personal policy of keeping my research open, and because I learned while taking the exam the unvielling of @horse_ebooks and couldn’t resist working it into the exam, and because maybe somebody enrolled in or thinking about applying for our PhD program might find it interesting, I’m posting my examination here (with some webifying of links).

At the time of this posting, I don’t yet know if I have passed.

1. Some e-mail spam detectors use statistical machine learning methods to continuously retrain a classifier based on user input (marking messages as spam or ham). These systems have been criticized for being vulnerable to mistraining by a skilled adversary who sends “tricky spam” that causes the classifier to be poisoned. Exam question: Propose tests that can determine how vulnerable a spam detector is to such manipulation. (Please limit your answer to two pages.)

Tests for classifier poisoning vulnerability in statistical spam filtering systems can consist of simulating particular attacks that would exploit these vulnerabilities. Many of these tests are described in Graham-Cumming, “Does Bayesian Poisoning exist?”, 2006 [pdf.gz], including:

  • For classifiers trained on a “natural” training data set D and a modified training data set D’ that has been generated to include more common words in messages labeled as spam, compare specificity, sensitivity, or more generally the ROC plots of each for performance. This simulates an attack that aims to increase the false positive rate by making words common to hammy messages be evaluated as spammy.
  • Same as above, but construct D’ to include many spam messages with unique words. This exploits a tendency in some Bayesian spam filters to measure the spamminess of a word by the percentage of spam messages that contain it. If successful, the attack dilutes the classifier’s sensitivity to spam over a variety of nonsense features, allowing more mundane spam to get through the filter as false negatives.

These two tests depend on increasing the number of spam messages in the data set in a way that strategically biases the classifier. This is the most common form of mistraining attack. Interestingly, these attacks assume that users will correctly label the poisoning messages as spam. So these attacks depend on weaknesses in the filter’s feature model and improper calibration to feature frequency.

A more devious attack of this kind would depend on deceiving the users of the filtering system to mislabel spam as ham or, more dramatically, acknowledge true ham that drives up the hamminess of features normally found in spam.

An example of an attack of this kind (though perhaps not intended as an attack per se) is @Horse_ebooks, a Twitter account that gained popularity while posting randomly chosen bits of prose and, only occasionally, links to purchase low quality self-help ebooks. Allegedly, it was originally a spam bot engaged in a poisoning/evasion attack, but developed a cult following who appreciated its absurdist poetic style. Its success (which only grew after the account was purchased by New York based performance artist Jacob Bakkila in 2011) inspired an imitative style of Twitter activity.

Assuming Twitter is retraining on this data, this behavior could be seen as a kind of poisoning attack, albeit by filter’s users against the system itself. Since it may benefit some Twitter users to have an inflated number of “followers” to project an exaggerated image of their own importance, it’s not clear whether it is in the interests of the users to assist in spam detection, or to sabotage it.

Whatever the interests involved, testing for this kind of vulnerability to this “tricky ham” attack can be conducted in a similar way to the other attacks: by padding the modified data set D’ with additional samples with abnormal statistical properties (e.g noisy words and syntax), this time labeled as ham, and comparing the classifiers along normal performance metrics.

2. Analytical models of cascading behavior in networks, e.g., threshold-based or contagion-based models, are well-suited for analyzing the social dynamics in open collaboration and peer production systems. Discuss.

Cascading behavior models are well-suited to modeling information and innovation diffusion over a network. They are well-suited to analyzing peer production systems to the extent that their dynamics consist of such diffusion over a non-trivial networks. This is the case when production is highly decentralized. Whether we see peer production as centralized or not depends largely on the scale of analysis.

Narrowing in, consider the problem of recruiting new participants to an ongoing collaboration around a particular digital good, such as an open source software product or free encyclopedia. We should expect the usual cascading models to be informative about the awareness and adoption of the good. But in most cases awareness and adoption are only necessary not sufficient conditions for active participation in production. This is because, for example, contribution may involve incurring additional costs and so be subject to different constraints than merely consuming or spreading the word about a digital good.

Though threshold and contagion models could be adapted to capture some of this reluctance through higher thresholds or lower contagion rates, these models fail to closely capture the dynamics of complex collaboration because they represent the cascading behavior as homogeneous. In many open collaborative projects, contributions (and the individual costs of providing them) are specialized. Recruited participants come equipped with their unique backgrounds. (von Krogh, G., Spaeth, S. & Lakhani, K. R. “Community, joining, and specialization in open source software innovation: a case study.” (2003)) So adapting behavior cascade models to this environment would require, at minimum, parameterization of per node capacities for project contribution. The participants in complex collaboration fulfil ecological niches more than they reflect the dynamics of large networked populations.

Furthermore, at the level of a closely collaborative on-line community, network structure is often trivial. Projects may be centralized around a mailing list, source code repository, or public forum that effectively makes the communication network a large clique of all participants. Cascading behavior models will not help with analysis of these cases.

On the other hand, if we zoom out to look at open collaboration as a decentralized process–say, of all open source software developers, or of distributed joke production on Weird Twitter–then network structure becomes important again, and the effects of diffusion may dominate the internal dynamics of innovation itself. Whether or not a software developer chooses to code in Python or Ruby, for example, may well depend on a threshold of the developer’s neighbors in a communication network. These choices allow for contagious adoption of new libraries and code.

We could imagine a distributed innovation system in which every node maintained its own repository of changes, some of which it developed on its own and others it adapted from its neighbors. Maybe the network of human innovators, each drawing from their experiences and skills while developing new ones in the company of others, is like this. This view highlights the emergent social behavior of open innovation, putting the technical architecture (which may affect network structure but could otherwise be considered exogenous) in the background. (See next exam question).

My opinion is that while cascading behavior models may in decentralized conditions capture important aspects of the dynamics of peer production, the basic models will fall short because they don’t consider the interdependence of behaviors. Digital products are often designed for penetration in different networks. For example, the choice of programming language in which to implement ones project influences its potential for early adoption and recruitment. Analytic modeling of these diffusion patterns with cascade models could gain from augmenting the model with representations of technical dependency.

3. Online communities present many challenges for governance and collective behavior, especially in common pool and peer-production contexts. Discuss the relative importance and role of both (1) site architectures and (2) emergent social behaviors in online common pool and/or peer-production contexts. Your answer should draw from more than one real-world example and make specific note of key theoretical perspectives to inform your response. Your response should take approximately 2 pages.

This question requires some unpacking. The sociotechnical systems we are discussing are composed of both technical architecture (often accessed as a web site, i.e. a “location” accessed through HTTP via a web browser) and human agents interacting socially with each other in a way mediated by the architecture (though not exclusively, c.f. Coleman’s work on in person meetings in hacker communities). If technology is “a man-made means to an end” (Heidegger, Question Concerning Technology), then we can ask of the technical architecture: which man, whose end? So questioning the roles of on-line architecture and emergent behaviors brings us to look at how the technology itself was the result of emergent social behavior of its architects. For we can consider “importance” from either the perspective of the users or that of the architects. These perspectives reflect different interests and so will have different standards for evaluating the importance of its components. (c.f. Habermas, Knowledge and Human Interests)

Let us consider socio-technical systems along a spectrum between two extremes. At one extreme are certain prominent systems–e.g. Yelp and Amazon cultivating common pools of reviews–for which the architects and the users are distinct. The site architecture is a means to the ends of the architects, effected through the stimulation of user activity.

Architects acting on users through technology

Drawing on Winner (“Do artifacts have politics?”), we can see that this socio-technical arrangement establishes a particular pattern of power and authority. Architects have direct control over the technology, which enables to the limits of its affordances user activity. Users can influence architects through the information their activity generates (often collected through the medium of the technical architecture itself), but have no direct coercive control. Rather, architects design the technology to motivate certain desirable activity using inter-user feedback mechanisms such as ways of expressing gratitude or comparing one’s performance with others. (see Cheshire and Antin, “The Social Psychological Effects of Feedback on the Production of Internet Information Pools”, 2008) In such a system, users can only gain control of their technical environment by exploiting vulnerabilities in the architecture in adversarial moves that looks a bit like security breaches. (See the first exam question for an example of user-driven information sabotage.) More likely, the vast majority of users will choose to free ride on any common pool resources made available and exit the system when inconvenienced, as the environment is ultimately a transactional one of service provider and consumer.

In these circumstances, it is only by design that social behaviors lead to peer production and common pools of resources. Technology, as an expression of the interests of the architects, plays a more important role than social emergence. To clarify the point, I’d argue that Facebook, despite hosting enormous amounts of social activity, does not enable significant peer production because its main design goals are to drive the creation of proprietary user data and ad clicks. Twitter, in contrast, has from the beginning been designed as a more open platform. The information shared on it is often less personal, so activity more easily crosses the boundary from private to public, enabling collective action (see Bimber et al., “Reconceptuaizing Collective Action in the Contemporary Media Environment”, 2005) It has facilitated (with varying consistency) the creation of third party clients, as well as applications that interact with its data but can be hosted as separate sites.

This open architecture is necessary but not sufficient for emergent common pool behavior. But the design for open possibilities is significant. It enables the development of novel, intersecting architectures to support the creation of new common pools. Taking Weird Twitter, framed as a peer production community for high quality tweets, as an example, we can see how the service Favstar (which aggregates and ranks tweets that have been highly “starred” and retweeted, and awards congratulatory tweets as prizes) provides historical reminders and relative rankings of tweet quality. Thereby facilitates a culture of production. Once formed, such a culture can spread and make use of other available architecture as well. Weird Twitter has inspired Twitter: The Comic, a Tumblr account illustrating “the greatest tweets of our generation.”

Consider another extreme case, the free software community that Kelty identifies as the recursive public. (Two Bits: The Cultural Significance of Free Software) In an idealized model, we could say that in this socio-technical system the architects and the users are the same.

Recursive public diagram

The artifacts of the recursive public have a different politics than those at the other end of our spectrum, because the coercive aspects of the architectural design are the consequences of the emergent social behavior of those affected by it. Consequently, technology created in this way is rarely restrictive of productive potential, but on the contrary is designed to further empower the collaborative communities that produced it. The history of Unix, Mozilla, Emacs, version control systems, issue tracking software, Wikimedia, and the rest can be read as the historical unfolding of the human interest in an alternative, emancipated form of production. Here, the emergent social behavior claims its importance over and above the particulars of the technology itself.

Reinventing wheels with Dissertron

I’ve found a vehicle for working on the Dissertron through the website of the course I’ll be co-teaching this Fall on Open Collaboration and Peer Production.

In the end, I went with Pelican, not Hyde. It was a difficult decision (because I did like the idea of supporting an open source project coming out of Chennai, especially after reading Coding Places. But I had it on good authority that Pelican was more featureful with cleaner code. So here I am.

The features I have in mind are crystalizing as I explore the landscape of existing tools more. This is my new list:

  • Automatically include academic metadata on each Dissertron page so it’s easy to slurp it into Zotero.
  • Include the Hypothes.is widget for annotations. I think Hypothes.is will be better for commenting that Disqus because it does annotations in-line, as opposed to comments in the footer. It also uses the emerging W3C Open Annotation standard. I’d like this to be as standards based as possible.
  • Use citeproc-js to render citations in the browser cleanly. I think this handles the issue of in-line linked academic citations without requiring a lot of manual work. The citeproc-js looks like it’s come out of Zotero as well. Since Elsevier bought Mendeley, Zotero seems like the more reliable ally to pick for independent scholarship.
  • Trickiest is going to be porting a lot of features from jekyll-scholar into a Pelican plug-in. I really want jekyll-scholar‘s bibliographic management. But I’m a little worried that Pelican isn’t well-designed for that sort of flexibility in theming. More soon.
  • I’m interested in trying to get the HTML output of Dissertron as close as possible to emerging de facto standards on what on-line scholarship is like. I’ve asked about what PLOS ONE does about this. The answer sounds way complicated: a tool chain the goes from Latex to Word Docs to NLM 3.0 XML (which I didn’t even know was a thing), and at last into HTML. I’m trying to start from Markdown because I think it’s a simple markup language for the future, but I’m not deep enough in that tool chain to understand how to replicate its idiosyncracies.

If I could have all these nice things, and maybe a pony, then I would be happy and have no more excuses for not actually doing research, as opposed to obsessing about the tooling around independent publishing.

Planning the Dissertron

In my PhD program, I’ve recently finished my coursework and am meant to start focusing on research for my dissertation. Maybe because of the hubbub around open access research, maybe because I still see myself as a ‘hacker’, maybe because it’s somehow recursively tied into my research agenda, or because I’m an open source dogmatic, I’ve been fantasizing about the tools and technology of publication that I want to work on my dissertation with.

For this project, which I call the Dissertron, I’ve got a loose bundle of requirements feature creeping its way into outer space:

  1. Incremental publishing of research and scholarship results openly to the web.
  2. Version control.
  3. Mathematical rendering a la LaTeX.
  4. Code highlighting a la the hacker blogs.
  5. In browser rendering of data visualizations with d3, where appropriate.
  6. Site is statically generated from elements on the file system, wherever possible.
  7. Machine readable metadata on the logical structure of the dissertation argument, which gets translated into static site navigation elements.
  8. Easily generated glossary with links for looking up difficult terms in-line (or maybe in-margin)
  9. A citation system that takes advantage of hyperlinking between resources wherever possible.
  10. Somehow, enable commenting. But more along the lines of marginalia comments (comments on particular lines or fragments of text) rather than blog comments. “Blog” style comments should be facilitated as notes on separately hosted dissertrons, or maybe a dissertron hub that aggregates and coordinates pollination of content between dissertrons.

This is a lot, and arguably just a huge distraction from working on my dissertation. However, it seems like this or something like it is a necessary next step in the advance of science and I don’t see how I really have much choice in the matter.

Unfortunately, I’m traveling, so I’m going to miss the PLOS workshop on Markdown for Science tomorrow. That’s really too bad, because Scholarly Markdown would get me maybe 50% of the way to what I want.

Right now the best tool chain I can imagine for this involves Scholarly Markdown, run using Pandoc, which I just now figured out is developed by a philosophy professor at Berkeley. Backing it by a Git repository would allow for incremental changes and version control.

Static site generation and hosting is a bit trickier. I feel like GitHub’s support of Jekyll make it a compelling choice, but hacking it to make it fit into the academic frame I’m thinking in might be more trouble than its worth. While it’s a bit of an oversimplification to say this, my impression is that at my university at least there is a growing movement to adopt Python as the programming language of choice for scientific computing. The exceptions seem to be people in the Computer Science department that are backing Scala.

(I like both languages and so can’t complain, except that it makes it harder to do interdisciplinary research if there is a technical barrier in their toolsets. As more of scientific research becomes automated, it is bound to get more crucial that scientific processes (broadly speaking) inter-operate. I’m incidentally excited to be working on these problems this summer for Berkeley’s new Social Science Data Lab. A lot of interesting architectural design is being masterminded by Aaron Culich, who manages the EECS department’s computing infrastructure. I’ve been meaning to blog about our last meeting for a while…but I digress)

Problem is, neither Python or Scala is Ruby, and Ruby is currently leading the game (in my estimate, somebody tell me if I’m wrong) in flexible and sexy smooth usable web design. And then there’s JavaScript, improbably leaking into the back end of the software stack after overflowing the client side.

So for the aspiring open access indie web hipster hacker science self-publisher, it’s hard to navigate the technical terrain. I’m tempted to string together my own rig depending mostly on Pandoc, but even that’s written in Haskell.

These implementation-level problems suggest that the problem needs to be pushed up a level of abstraction to the question of API and syntax standards around scientific web publishing. Scholarly Markdown can be a standard, hopefully with multiple implementations. Maybe there needs to be a standard around web citations as well (since in an open access world, we don’t need the same level of indirection between a document and the works it cites. Like blog posts, web publications can link to the content it derives from directly.)

POSSE homework: how to contribute to FOSS without coding

One of the assignments for the POSSE workshop is the question of how to contribute to FOSS when you aren’t a coder.

I find this an especially interesting topic because I think there’s a broader political significance to FOSS, but those that see FOSS as merely the domain of esoteric engineers can sometimes be a little freaked out by this idea. It also involves broader theoretical questions about whether or how open source jives with participatory design.

In fact, they have compiled a list of lists of ways to contribute to FOSS without coding: this, this, this, and this are provided in the POSSE syllabus.

Turning our attention from the question in the abstract, we’re meant to think about it in the context of our particular practices.

For our humanitarian FOSS project of choice, how are we interested in contributing? I’m fairly focused in my interests on open source participation these days: I’m very interested in the problem of community metrics and especially how innovation happens and diffuses within these communities. I would like to be able to build a system for evaluating that kind of thing that can be applied broadly to many projects. Ideally, it could do things like identify talented participants across multiple projects, or suggest interventions for making projects work better.

It’s an ambitious research project, but one for which there is plenty of data to investigate from the open source communities themselves.

What about teaching a course on such a thing? I anticipate that my students are likely to be interested in design as well as positioning their own projects within the larger open source ecosystem. Some of the people who I hope will take the class have been working on FuturePress, an open source e-book reading platform. As they grow the project and build the organization around it, they will want to be working with constituent technologies and devising a business model around their work. How can a course on Open Collaboration and Peer Production support that?

These concerns touch on so many issues outside of the consideration of software engineering narrowly (including industrial organization, communication, social network theory…) that it’s daunting to try to fit it all into one syllabus. But we’ve been working on one that has a significant hands-on component as well. Really I think the most valuable skill in the FOSS world is having the chutzpah to approach a digital community, propose what you are thinking, and take the criticism or responsibility that comes with that.

What concrete contribution a student uses to channel that energy should…well, I feel like it should be up to them. But is that enough direction? Maybe I’m not thinking concretely enough for this assignment myself.

POSSE and the FOSS field trip

I’m excited to be participating in POSSE — Professor’s Open Source Software Experience — this coming weekend in Philidelphia. It’s designed to train computer science professors how to use participation in open source communities to teach computer science. Somehow, they let me in as a grad student.

My goals are somewhat unusual for the program, I imagine. I’m not even in a computer science department. But I do have background in open source development, and this summer I’ll be co-teaching a course on Open Collaboration and Peer Production at Berkeley’s I School. We aren’t expecting all the students to be coders, so though there is a hands on component, it may be with other open collaborative projects like Wikipedia or OpenStreetMap. Though we aren’t filling a technical requirement (the class is aimed at Masters students), it will fulfill a management requirement. So you might say it’s a class on theory and practice of open collaborative community management.

My other education interest besides my own teaching is in the role of open source and computer science education in the larger economy. Lately there have been a lot of startups around programmer education. That makes sense, because demand for programming talent exceeds supply now, and so there’s an opportunity to train new developers. I’m curious whether it would be possible to build and market an on-line, MOOC-style programming course based on apprenticeship within open source communities.

One of our assignments to complete before the workshop is an open source community field trip. We’re suppose to check out a number of projects and get a sense of how to distinguish between strong and weak projects at a glance.

The first thing I noticed was that SourceForge is not keeping up with web user experience standards. That’s not so surprising, since as a FOSS practitioner I’m more used to projects hosted on GitHub and Google Code. Obviously, that hasn’t always been the case. But I’m beginning to think SourceForge’s role may now be mainly historic. Either that, or I have a heavy bias in my experience because I’ve been working with a not of newer, “webbier” code. Maybe desktop projects still have a strong SourceForge presence.

I was looking up mailing server software, because I’m curious about mining and modding mailing lists as a research area. Fundamentally, they seem like one of the lightest weight and most robust forms of on-line community out there, and the data is super rich.

Mailing lists on source Forge appear to be mainly in either Java, PHP, or some blend of Python and C. There are a couple Enterprise solutions. Several of the projects have moved their source code, hosted version control, and issue tracking off of SourceForge. Though the projects were ranked from “Inactive” through “Beta” to “Mature” and “Production/Stable”, usage of the tags was a little inconsistent across projects. Projects with a lot of weekly downloads tended to be either Mature or Production/Stable or both.

I investigated Mailman in particular. It’s an impressive piece of software; who knows how many people use Mailman mailing lists? I’m probably on at least ten myself. But it’s a fairly humble project in terms of its self-presentation and what people have done with it.

Turns out it has a lead developer, Barry Warsaw, who works at Canonical, and a couple other core committers, in addition to other contributors. There appears to be a v3.0 in production, which suddenly I’m pretty excited about.

POSSE has a focus on humanitarian FOSS projects. I’m not sure exactly how they define humanitarian, “that is, projects for which the primary purpose is to provide some social benefit such as economic development, disaster relief, health care, ecology. Examples include Mifos, Sahana and OpenMRS.”

For the purpose of this workshop I plan to look into Ushahaidi. I’ve heard so many good things about it, but frustratingly even after working four years on open source geospatial software, including a couple crowdsourced mapping apps, I never took a solid look at Ushahidi. Maybe because it was in PHP. I’m proud to say the project I put the most effort into, GeoNode, also has a humanitarian purpose. (GeoNode also now has a very pretty new website, and totally revamped user interface for its 2.0 release, now in alpha.) And though not precisely a software project, I’ve spent a lot of time admiring the intrepid Humanitarian OpenStreetMap Team for their use of open data as a humanitarian means and end.

But Ushahidi–there’s something you don’t see everyday.

We’re asked, on the POSSE field trip prompt, how we would decide whether our selected project was worth contributing to as an IT professional. The answer is: it depends on if I could do it for my job, but I’ve asked around about the project and community some and it seems like great people and usable software. I’d be proud to contribute to it, so at this point I expect my comparative advantage would be on the data analysis end (both of the community that builds it and data created by it) rather than to the core.

We were also asked to check out projects on Ohloh, which has also had a user interface revamp since I last looked carefully at it. Maybe significantly, we were asked to compare a number of different projects (two of them web browsers), but there was no feature on the website that provided a side-by-side comparison of the projects.

Also, one thing Ohloh doesn’t do yet is analytics on mailing lists. Which is odd, since that’s often where developers within a community get the most visceral sense of how large their community is (in my experience). Mailing lists wind up being the place where users can as participants affect software development, and where a lot of conflict resolution occurs. (Though there can be a lot of this on issue tracker discussions as well.)

This summer I hope to begin some more rigorous research into mailing list discussions and open source analytics. Seeing how Ohloh has moved forward reminds me I should be sharing my research with them. The focus of POSSE on project evaluation is encouraging–I’m curious to see where it goes next.

The recursive public as practice and imaginary

Chris Kelty’s Two Bits: The Cultural Significance of Free Software is one of the best synthetic histories of the Internet and Free Culture that I’ve encountered so far. Most exciting about it is his concept of the recursive public, the main insight of his extensive ethnographic work:

A recursive public is a public that is vitally concerned with the material and practical maintenance and modification of the technical, legal, practical, and conceptual means of its own existence as a public; it is a collective independent of other forms of constituted power and is capable of speaking to existing forms of power through the production of actually existing alternatives.

Speaking today about the book with Nick Doty and Ashwin Mathew, we found it somewhat difficult to tease out the boundaries of this concept. What publics aren’t recursive publics? And are the phenomena Kelty sometimes picks out by this concept (events in the history of Free Software) really examples of a public after all?

Just to jot down some thoughts:

  • If what makes the public is a social organization that contests other forms of institutional power (such as the state or the private sector), then there does seem to be an independence to the FOSS movement that makes the label appropriate. I believe this holds even when the organizations embodying this movement explicitly take part in state or commercial activities–as in resistance to SOPA, for example–though Ashwin seemed to think that was problematic.
  • I read recursion to refer to many aspects of this public. These include both the mutual reinforcement of its many components through time and the drive to extend its logic (e.g. the logic of open systems that originated in the IT sector in the 80’s) beyond its limits. If standards are open, then the source code should be next. If the source code is open, then the hardware is next. If the company’s aren’t open, then they’re next. Etc.

I find the idea of the recursive public compelling because it labels something aspirational: a functional unit of society that is cohesive despite its internal ideological diversity. However, it can be hard to tell whether Kelty is describing what he thinks is already the case or what he aspires for it to be.

The question is whether the recursive public is referring to the social imaginary of the FOSS movement or its concrete practices (which he lists: arguing about license, sharing source code, conceiving of the open, and coordinating collaboration). He does brilliant work in showing how the contemporary FOSS movement is a convergence of the latter. Misusing a term of Piaget’s, I’m tempted to call this an operational synthesis, analogous to how a child’s concept of time is synthesized through action from multiple phenomenological modalities. Perhaps it’s not irresponsible to refer to the social synthesis of a unified practice from varied origins with the same term.

Naming these practices, then, is a way of making them conscious and providing the imaginary with a new understanding of its situation.

Saskia Sassen in Territory, Authority, Rights notes that in global activism, action and community organization is highly local; what is global is the imagined movement in which one participates. Manuel Castells refers to this as the power of identity in social movements; the deliberate “reprogramming of networks” (of people) with new identities is a form of communication power that can exert political change.

It’s difficult for me to read Two Bits and not suspect Kelty of deliberately proposing the idea of a recursive public as an intellectual contribution to the self-understanding of the FOSS movement in a way that is inclusive of those that vehemently deny that FOSS is a movement. By identifying a certain set of shared practices as a powerful social force with its own logic in spite of and even because of its own internal ideological cacophony (libertarian or socialist? freedom or openness? fun, profit, or justice?), he is giving people engaged in those practices a kind of class consciousness–if they read his book.

That is good, because the recursive public is only one of many powers tussling over control of the Internet, and it’s a force for justice.

on courage in the face of failure developing bluestocking

It would be easy to be discouraged by early experiments with bluestocking.

sb@lebenswelt:~/dev/bluestocking$ python factchecker.py "Courage is what makes us. Courage is what divides us. Courage is what drives us. Courage is what stops us. Courage creates news. Courage demands more. Courage creates blame. Courage brings shame. Courage shows in school. Courage determines the cool. Courage divides the weak. Courage pours out like a leak. Courage puts us on a knee. Courage makes us free. Courage makes us plea. Courage helps us flee. Corey Fauchon"
Looking up Fauchon
Lookup failed
Looking up shame
Looking up news
Looking up puts
Lookup failed
Looking up leak
Lookup failed
Looking up stops
Lookup failed
Looking up Courage
Looking up helps
Lookup failed
Looking up divides
Lookup failed
Looking up shows
Lookup failed
Looking up demands
Lookup failed
Looking up pours
Lookup failed
Looking up brings
Lookup failed
Looking up weak
Lookup failed
Looking up drives
Lookup failed
Looking up free
Looking up blame
Lookup failed
Looking up Corey
Lookup failed
Looking up plea
Lookup failed
Looking up knee
Looking up flee
Lookup failed
Looking up cool
Looking up school
Looking up determines
Lookup failed
Looking up like
Looking up us
Lookup failed
Looking up creates
Lookup failed
Looking up makes
Lookup failed
Building knowledge base
Querying knowledge base with original document
Consistency: 0
Contradictions: []
Supported: []
Novel: [(True, 'helps', 'flee'), (True, 'helps', 'us'), (True, 'determines', 'cool'), (True, 'like', 'leak'), (True, 'puts', 'knee'), (True, 'puts', 'us'), (True, 'pours', 'leak'), (True, 'pours', 'like'), (True, 'brings', 'shame'), (True, 'drives', 'us'), (True, 'stops', 'us'), (True, 'creates', 'blame'), (True, 'creates', 'news'), (True, 'Courage', 'shame'), (True, 'Courage', 'news'), (True, 'Courage', 'puts'), (True, 'Courage', 'leak'), (True, 'Courage', 'stops'), (True, 'Courage', 'helps'), (True, 'Courage', 'divides'), (True, 'Courage', 'shows'), (True, 'Courage', 'demands'), (True, 'Courage', 'pours'), (True, 'Courage', 'brings'), (True, 'Courage', 'weak'), (True, 'Courage', 'drives'), (True, 'Courage', 'free'), (True, 'Courage', 'blame'), (True, 'Courage', 'plea'), (True, 'Courage', 'knee'), (True, 'Courage', 'flee'), (True, 'Courage', 'cool'), (True, 'Courage', 'school'), (True, 'Courage', 'determines'), (True, 'Courage', 'like'), (True, 'Courage', 'us'), (True, 'Courage', 'creates'), (True, 'Courage', 'makes'), (True, 'us', 'knee'), (True, 'us', 'flee'), (True, 'us', 'plea'), (True, 'us', 'free'), (True, 'Corey', 'Fauchon'), (True, 'makes', 'plea'), (True, 'makes', 'free'), (True, 'makes', 'us'), (True, 'divides', 'weak'), (True, 'divides', 'us'), (True, 'shows', 'school')]

But, then again, our ambitions are outlandish. Nevertheless, there is a silver lining:

sb@lebenswelt:~/dev/bluestocking$ python factchecker.py "The sky is not blue."
Looking up blue
Looking up sky
Building knowledge base
Querying knowledge base with original document
Consistency: -1
Contradictions: [(True, 'sky', 'blue')]
Supported: []
Novel: []

Why federally funded software should be open source

Recently, open access to government funded research has gained attention and traction. Britain and Europe have both announced that they will make research they fund open access. In the United States, a community-driven effort has pushed a Whitehouse petition to the Obama administration for a similar policy. We may be experiencing a sea change.

Perhaps on the coattails of this movement, Open Source for America has launched a petition asking for a similar policy regarded federally funded software development: share all government-developed software under an open source license.

This is a really good idea.

Unfortunately, software development and the government IT procurement are so misunderstood that this is not likely to excite those who aren’t somehow directly by the issue. That is too bad, because every American stands to benefit from this sort of change. That makes it important for those of us who do understand to act.

I’ll try to illustrate why this is important with a story, or really a template of a story. This is a story told in countless cases of government software procurement:

ACRNM, a federal agency, has realized that its database management system and its user interface have not been updated since the late 90’s, because building it the last time was such a headache. It never really worked the way they wanted it, and the vendor who built it for them has since vanished off the face of the earth. Desperate and beleaguered, ACRNM finally gets the budget together to build a new system, and put out a bid.

Vendors that have navigated the prerequisite bureaucratic maze flock to this bid, knowing victory will be lucrative. Among them is FUBAR Enterprise Solutions. They know that whatever they build, they have a revenue stream for life. Not only does ACRNM have an enormous internal incentive to declare the new system a success to justify their budget, but they also have nobody to turn to for help with their software when it inevitably fails but FUBAR. FUBAR can continue extorting ACRNM for cash until ACRNM gives up, and the cycle continues.

What is wrong with this picture? Let’s count the problems:

  • FUBAR has ACRNM by the (pardon me, there’s really no other way to put this) balls. The term is vendor lock-in. The second ACRNM installs their system, FUBAR becomes a parasite on the government leeching taxpayer money. This is because the software is proprietary. No other company is legally allowed to fix or modify FUBAR’s proprietary system, so FUBAR faces no competition and so can charge through the nose. If the software were open source, ACRNM could turn to other contractors to repair their system, lowering total costs.
  • ACRNM has to do its work with worse software. Remember, this is a government agency that we pay taxes to for their services. With so much government activity boiling down to bureaucratic information processing, and so much innovation in software engineering and design, and so much budgetary pressure, you would think that the federal government would leap at technological innovation. But proprietary contracting causes the government to cripple itself at a tipping point.
  • Today, government agencies like ACRNM are wisening up and turning to open source solutions. But it’s a slow, slow process. This is partly because FUBAR and its buddy companies who, after so many years of this relationship with government, are now an entrenched lobby that will sow Fear, Uncertainty, and Doubt about open source alternatives if they can get away with it. In recent years, since open source has become more mainstream, these companies are admitting the viability of open source compatibility and mixed solutions. They see the writing on the wall. They will of course fight an open source purchasing mandate with everything they have.
  • Few governmental problems are unique. If ACRNM is paying for a new custom software solution, there likely many other agencies–at federal, state, or local level–with a similar problem. Civic Commons has already jumped on this opportunity by trying to facilitate technology reuse across city governments. If ACRNM invests in an open source solution, then other agencies can seek out that solution and adapt it to their needs, reducing government IT costs overall.
  • As we’ve discussed, open source software creates a competitive market for services. That makes an open source mandate a job creation program. Every new open technology is an opportunity for several small businesses to open. These are businesses that share fixed costs to market entry and add value through technologist consulting and custom development. Jobs customizing existing open source solutions can be well-paid with even an entry-level programming skill set, and are a good way to build a lasting career in the technology sector. Federal investment in open source software builds our national supply of technology skill faster than proprietary investment.
  • Lastly, but certainly not least, is the possible reuse of open source technology by the private sector. Just as federally funded research contributes to growth in America’s scientific industry, federal investment in software provides a foundation for stronger tech companies. Openness in both cases expands the impact of the funding.

So, to recap: if this sort of policy passes, the winners are government employees, taxpayers, entry-level workers with a minimum of technical skills, and the tech industry in general. The losers (in the short term) are those existing companies that have the federal government locked into custom proprietary software contracts.

I want to make a point clear: I am talking specifically about new software development in this post. Purchasing licenses for existing proprietary software is a different story.

Brian Carver, professor at UC Berkeley School of Information, has offered this clarification of what an open source mandate could look like:

  1. An unambiguous policy and awareness that all software created by
    federal employees as part of their job duties is not subject to copyright
    at all and is born in the public domain, and therefore not subject to any
    license terms at all, including a FOSS license.
  2. Given 1, the federal government should either just use github/bitbucket
    or set up a similar repository to share all such federal government
    software that is in the public domain.
  3. When the federal government contracts with developers for software,
    there should be an unambiguous policy that all such software must be
    licensed under a FOSS license unless subject to a specifically-requested
    exemption (national security, military, etc.)

A central election issue is the size and role of government in the economy. Politicians on the right advocate for smaller government and a strong private sector with competitive markets. Politicians on the left advocate for government’s active investment in the economy.

Proprietary government-developed software is the worst of both worlds: inefficient government spending to create parasitic, uncompetitive companies that don’t invest their technology back into the economy. An open source mandate would give us the best of both worlds: efficient government spending that shrinks government (by easing overhead) while investing in new technology and competitive businesses.

The movement for open access to government funded research is strong and winning victories around the world. Maybe we can do the same for government funded software development.

We need help naming a software project

Speaking of computational argumentation, Dave Kush and I are starting a software project and we need a name for it.

The purpose of the software is to extract information from a large number of documents, and then merge this information together into a knowledge base. We think this could be pretty great because it would support:

  • Conflict detection and resolution. In the process of combining information from many sources into a single knowledge base, the system should be able to mark conflicts of information. That would indicate an inconsistency or controversy between the documents, which could be flagged for further investigation.
  • Naturally queryable aggregate knowledge. We anticipate being able to build a query interface that is a natural extension of this system: just run the query through the extraction process and compare the result for consistency with the knowledge base. This would make the system into a “dissonance engine,” useful for opposition research or the popping of filter bubbles.

I should say that neither of us knows exactly what we are doing. But Dave’s almost got his PhD in human syntax so I think we’ve got a shot at building a sweet parser. What’s more, we’ve got the will and plan. It will be open source, of course, and we’re eager for collaborators.

We have one problem:

We don’t know what to call it.

I can’t even make the GitHub account for our code until we have a good name. And until then we’ll be sending Python scripts to each other as email attachments and that will never get anywhere.

Please help us. Tell us what to name our project. If we use your name, we’ll do something awesome for you some day.

Scratch that. We’re calling it Bluestocking. The GitHub repo is here.

Academia vs. FOSS: The Good, The Bad, and the Ugly

Mel Chua has been pushing forward on the theme of FOSS culture in academia, and has gotten a lot of wonderful comments, many about why it’s not so simple to just port one culture over to the other. I want to try to compile items from Mel, comments on that post, and a few other sources. The question is: what are the salient differences between FOSS and academia?

I will proceed using the now-standard Spaghetti Western classification schema.

The Good

  • Universities tend to be more proactive about identifying and aiding newcomers that are struggling, as opposed to many FOSS projects that have high failure-and-dropout rates due to poorly designed scaffolding.
  • Academia is much more demographically inclusive. FOSS communities are notoriously imbalanced in terms of gender and race.

The Bad

  • The academic fear of having ones results scooped or stolen results in redundant, secrecy, and lonely effort. FOSS communities get around this by having good systems for attribution of incremental progress.
  • Despite scientific ideals, academic scientific research is getting less reproducible, and therefore less robust, because of closed code and data. FOSS work is often more reproducible (though not if its poorly documented).
  • Closed access academic journals hold many disciplines hostage by holding a monopoly on prestige. This is changing with the push for open access research, but this is still a significant issue. FOSS communities may care about community prestige, but often that prestige comes from community helpfulness or stake in a project. If metrics are used, they are often implicit ones extractable from the code repository itself, like Ohloh. Altmetrics are a solution to this problem.

The Ugly

  • In both FOSS and academia, a community of collaborators needs to form around shared interests and skills. But FOSS has come to exemplify the power of the distributed collaboration towards pragmatic goals. One is judged more by ones contributions than by ones academic pedigree, which means that FOSS does not have as much institutional gatekeeping.
  • Tenure committees look at papers published, not software developed. So there is little incentive for making robust software as part of the research process, however much that might allow reproducibility and encourage collaboration.
  • Since academics are often focused on “the frontier”, they don’t pay much attention to “building blocks”. Academic research culture tends to encourage this because it’s a race for discovery. FOSS regards care of the building blocks as a virtue and rewards the effort with stronger communities built on top of those blocks.
  • One reason for the difference between academia and FOSS is bandwidth. Since publications have page limits and are also the main means of academic communication, one wants to dedicate as much space as possible to juicy results at the expense of process documentation that would aid reproducibility. Since FOSS developed using digital communication tools with fewer constraints, it doesn’t have this problem. But academia doesn’t yet value contributions to this amorphous digital wealth of knowledge.

Have I left anything out?