Category: open source software

Reinventing wheels with Dissertron

I’ve found a vehicle for working on the Dissertron through the website of the course I’ll be co-teaching this Fall on Open Collaboration and Peer Production.

In the end, I went with Pelican, not Hyde. It was a difficult decision (because I did like the idea of supporting an open source project coming out of Chennai, especially after reading Coding Places. But I had it on good authority that Pelican was more featureful with cleaner code. So here I am.

The features I have in mind are crystalizing as I explore the landscape of existing tools more. This is my new list:

  • Automatically include academic metadata on each Dissertron page so it’s easy to slurp it into Zotero.
  • Include the Hypothes.is widget for annotations. I think Hypothes.is will be better for commenting that Disqus because it does annotations in-line, as opposed to comments in the footer. It also uses the emerging W3C Open Annotation standard. I’d like this to be as standards based as possible.
  • Use citeproc-js to render citations in the browser cleanly. I think this handles the issue of in-line linked academic citations without requiring a lot of manual work. The citeproc-js looks like it’s come out of Zotero as well. Since Elsevier bought Mendeley, Zotero seems like the more reliable ally to pick for independent scholarship.
  • Trickiest is going to be porting a lot of features from jekyll-scholar into a Pelican plug-in. I really want jekyll-scholar‘s bibliographic management. But I’m a little worried that Pelican isn’t well-designed for that sort of flexibility in theming. More soon.
  • I’m interested in trying to get the HTML output of Dissertron as close as possible to emerging de facto standards on what on-line scholarship is like. I’ve asked about what PLOS ONE does about this. The answer sounds way complicated: a tool chain the goes from Latex to Word Docs to NLM 3.0 XML (which I didn’t even know was a thing), and at last into HTML. I’m trying to start from Markdown because I think it’s a simple markup language for the future, but I’m not deep enough in that tool chain to understand how to replicate its idiosyncracies.

If I could have all these nice things, and maybe a pony, then I would be happy and have no more excuses for not actually doing research, as opposed to obsessing about the tooling around independent publishing.

Planning the Dissertron

In my PhD program, I’ve recently finished my coursework and am meant to start focusing on research for my dissertation. Maybe because of the hubbub around open access research, maybe because I still see myself as a ‘hacker’, maybe because it’s somehow recursively tied into my research agenda, or because I’m an open source dogmatic, I’ve been fantasizing about the tools and technology of publication that I want to work on my dissertation with.

For this project, which I call the Dissertron, I’ve got a loose bundle of requirements feature creeping its way into outer space:

  1. Incremental publishing of research and scholarship results openly to the web.
  2. Version control.
  3. Mathematical rendering a la LaTeX.
  4. Code highlighting a la the hacker blogs.
  5. In browser rendering of data visualizations with d3, where appropriate.
  6. Site is statically generated from elements on the file system, wherever possible.
  7. Machine readable metadata on the logical structure of the dissertation argument, which gets translated into static site navigation elements.
  8. Easily generated glossary with links for looking up difficult terms in-line (or maybe in-margin)
  9. A citation system that takes advantage of hyperlinking between resources wherever possible.
  10. Somehow, enable commenting. But more along the lines of marginalia comments (comments on particular lines or fragments of text) rather than blog comments. “Blog” style comments should be facilitated as notes on separately hosted dissertrons, or maybe a dissertron hub that aggregates and coordinates pollination of content between dissertrons.

This is a lot, and arguably just a huge distraction from working on my dissertation. However, it seems like this or something like it is a necessary next step in the advance of science and I don’t see how I really have much choice in the matter.

Unfortunately, I’m traveling, so I’m going to miss the PLOS workshop on Markdown for Science tomorrow. That’s really too bad, because Scholarly Markdown would get me maybe 50% of the way to what I want.

Right now the best tool chain I can imagine for this involves Scholarly Markdown, run using Pandoc, which I just now figured out is developed by a philosophy professor at Berkeley. Backing it by a Git repository would allow for incremental changes and version control.

Static site generation and hosting is a bit trickier. I feel like GitHub’s support of Jekyll make it a compelling choice, but hacking it to make it fit into the academic frame I’m thinking in might be more trouble than its worth. While it’s a bit of an oversimplification to say this, my impression is that at my university at least there is a growing movement to adopt Python as the programming language of choice for scientific computing. The exceptions seem to be people in the Computer Science department that are backing Scala.

(I like both languages and so can’t complain, except that it makes it harder to do interdisciplinary research if there is a technical barrier in their toolsets. As more of scientific research becomes automated, it is bound to get more crucial that scientific processes (broadly speaking) inter-operate. I’m incidentally excited to be working on these problems this summer for Berkeley’s new Social Science Data Lab. A lot of interesting architectural design is being masterminded by Aaron Culich, who manages the EECS department’s computing infrastructure. I’ve been meaning to blog about our last meeting for a while…but I digress)

Problem is, neither Python or Scala is Ruby, and Ruby is currently leading the game (in my estimate, somebody tell me if I’m wrong) in flexible and sexy smooth usable web design. And then there’s JavaScript, improbably leaking into the back end of the software stack after overflowing the client side.

So for the aspiring open access indie web hipster hacker science self-publisher, it’s hard to navigate the technical terrain. I’m tempted to string together my own rig depending mostly on Pandoc, but even that’s written in Haskell.

These implementation-level problems suggest that the problem needs to be pushed up a level of abstraction to the question of API and syntax standards around scientific web publishing. Scholarly Markdown can be a standard, hopefully with multiple implementations. Maybe there needs to be a standard around web citations as well (since in an open access world, we don’t need the same level of indirection between a document and the works it cites. Like blog posts, web publications can link to the content it derives from directly.)

POSSE homework: how to contribute to FOSS without coding

One of the assignments for the POSSE workshop is the question of how to contribute to FOSS when you aren’t a coder.

I find this an especially interesting topic because I think there’s a broader political significance to FOSS, but those that see FOSS as merely the domain of esoteric engineers can sometimes be a little freaked out by this idea. It also involves broader theoretical questions about whether or how open source jives with participatory design.

In fact, they have compiled a list of lists of ways to contribute to FOSS without coding: this, this, this, and this are provided in the POSSE syllabus.

Turning our attention from the question in the abstract, we’re meant to think about it in the context of our particular practices.

For our humanitarian FOSS project of choice, how are we interested in contributing? I’m fairly focused in my interests on open source participation these days: I’m very interested in the problem of community metrics and especially how innovation happens and diffuses within these communities. I would like to be able to build a system for evaluating that kind of thing that can be applied broadly to many projects. Ideally, it could do things like identify talented participants across multiple projects, or suggest interventions for making projects work better.

It’s an ambitious research project, but one for which there is plenty of data to investigate from the open source communities themselves.

What about teaching a course on such a thing? I anticipate that my students are likely to be interested in design as well as positioning their own projects within the larger open source ecosystem. Some of the people who I hope will take the class have been working on FuturePress, an open source e-book reading platform. As they grow the project and build the organization around it, they will want to be working with constituent technologies and devising a business model around their work. How can a course on Open Collaboration and Peer Production support that?

These concerns touch on so many issues outside of the consideration of software engineering narrowly (including industrial organization, communication, social network theory…) that it’s daunting to try to fit it all into one syllabus. But we’ve been working on one that has a significant hands-on component as well. Really I think the most valuable skill in the FOSS world is having the chutzpah to approach a digital community, propose what you are thinking, and take the criticism or responsibility that comes with that.

What concrete contribution a student uses to channel that energy should…well, I feel like it should be up to them. But is that enough direction? Maybe I’m not thinking concretely enough for this assignment myself.

POSSE and the FOSS field trip

I’m excited to be participating in POSSE — Professor’s Open Source Software Experience — this coming weekend in Philidelphia. It’s designed to train computer science professors how to use participation in open source communities to teach computer science. Somehow, they let me in as a grad student.

My goals are somewhat unusual for the program, I imagine. I’m not even in a computer science department. But I do have background in open source development, and this summer I’ll be co-teaching a course on Open Collaboration and Peer Production at Berkeley’s I School. We aren’t expecting all the students to be coders, so though there is a hands on component, it may be with other open collaborative projects like Wikipedia or OpenStreetMap. Though we aren’t filling a technical requirement (the class is aimed at Masters students), it will fulfill a management requirement. So you might say it’s a class on theory and practice of open collaborative community management.

My other education interest besides my own teaching is in the role of open source and computer science education in the larger economy. Lately there have been a lot of startups around programmer education. That makes sense, because demand for programming talent exceeds supply now, and so there’s an opportunity to train new developers. I’m curious whether it would be possible to build and market an on-line, MOOC-style programming course based on apprenticeship within open source communities.

One of our assignments to complete before the workshop is an open source community field trip. We’re suppose to check out a number of projects and get a sense of how to distinguish between strong and weak projects at a glance.

The first thing I noticed was that SourceForge is not keeping up with web user experience standards. That’s not so surprising, since as a FOSS practitioner I’m more used to projects hosted on GitHub and Google Code. Obviously, that hasn’t always been the case. But I’m beginning to think SourceForge’s role may now be mainly historic. Either that, or I have a heavy bias in my experience because I’ve been working with a not of newer, “webbier” code. Maybe desktop projects still have a strong SourceForge presence.

I was looking up mailing server software, because I’m curious about mining and modding mailing lists as a research area. Fundamentally, they seem like one of the lightest weight and most robust forms of on-line community out there, and the data is super rich.

Mailing lists on source Forge appear to be mainly in either Java, PHP, or some blend of Python and C. There are a couple Enterprise solutions. Several of the projects have moved their source code, hosted version control, and issue tracking off of SourceForge. Though the projects were ranked from “Inactive” through “Beta” to “Mature” and “Production/Stable”, usage of the tags was a little inconsistent across projects. Projects with a lot of weekly downloads tended to be either Mature or Production/Stable or both.

I investigated Mailman in particular. It’s an impressive piece of software; who knows how many people use Mailman mailing lists? I’m probably on at least ten myself. But it’s a fairly humble project in terms of its self-presentation and what people have done with it.

Turns out it has a lead developer, Barry Warsaw, who works at Canonical, and a couple other core committers, in addition to other contributors. There appears to be a v3.0 in production, which suddenly I’m pretty excited about.

POSSE has a focus on humanitarian FOSS projects. I’m not sure exactly how they define humanitarian, “that is, projects for which the primary purpose is to provide some social benefit such as economic development, disaster relief, health care, ecology. Examples include Mifos, Sahana and OpenMRS.”

For the purpose of this workshop I plan to look into Ushahaidi. I’ve heard so many good things about it, but frustratingly even after working four years on open source geospatial software, including a couple crowdsourced mapping apps, I never took a solid look at Ushahidi. Maybe because it was in PHP. I’m proud to say the project I put the most effort into, GeoNode, also has a humanitarian purpose. (GeoNode also now has a very pretty new website, and totally revamped user interface for its 2.0 release, now in alpha.) And though not precisely a software project, I’ve spent a lot of time admiring the intrepid Humanitarian OpenStreetMap Team for their use of open data as a humanitarian means and end.

But Ushahidi–there’s something you don’t see everyday.

We’re asked, on the POSSE field trip prompt, how we would decide whether our selected project was worth contributing to as an IT professional. The answer is: it depends on if I could do it for my job, but I’ve asked around about the project and community some and it seems like great people and usable software. I’d be proud to contribute to it, so at this point I expect my comparative advantage would be on the data analysis end (both of the community that builds it and data created by it) rather than to the core.

We were also asked to check out projects on Ohloh, which has also had a user interface revamp since I last looked carefully at it. Maybe significantly, we were asked to compare a number of different projects (two of them web browsers), but there was no feature on the website that provided a side-by-side comparison of the projects.

Also, one thing Ohloh doesn’t do yet is analytics on mailing lists. Which is odd, since that’s often where developers within a community get the most visceral sense of how large their community is (in my experience). Mailing lists wind up being the place where users can as participants affect software development, and where a lot of conflict resolution occurs. (Though there can be a lot of this on issue tracker discussions as well.)

This summer I hope to begin some more rigorous research into mailing list discussions and open source analytics. Seeing how Ohloh has moved forward reminds me I should be sharing my research with them. The focus of POSSE on project evaluation is encouraging–I’m curious to see where it goes next.

The recursive public as practice and imaginary

Chris Kelty’s Two Bits: The Cultural Significance of Free Software is one of the best synthetic histories of the Internet and Free Culture that I’ve encountered so far. Most exciting about it is his concept of the recursive public, the main insight of his extensive ethnographic work:

A recursive public is a public that is vitally concerned with the material and practical maintenance and modification of the technical, legal, practical, and conceptual means of its own existence as a public; it is a collective independent of other forms of constituted power and is capable of speaking to existing forms of power through the production of actually existing alternatives.

Speaking today about the book with Nick Doty and Ashwin Mathew, we found it somewhat difficult to tease out the boundaries of this concept. What publics aren’t recursive publics? And are the phenomena Kelty sometimes picks out by this concept (events in the history of Free Software) really examples of a public after all?

Just to jot down some thoughts:

  • If what makes the public is a social organization that contests other forms of institutional power (such as the state or the private sector), then there does seem to be an independence to the FOSS movement that makes the label appropriate. I believe this holds even when the organizations embodying this movement explicitly take part in state or commercial activities–as in resistance to SOPA, for example–though Ashwin seemed to think that was problematic.
  • I read recursion to refer to many aspects of this public. These include both the mutual reinforcement of its many components through time and the drive to extend its logic (e.g. the logic of open systems that originated in the IT sector in the 80’s) beyond its limits. If standards are open, then the source code should be next. If the source code is open, then the hardware is next. If the company’s aren’t open, then they’re next. Etc.

I find the idea of the recursive public compelling because it labels something aspirational: a functional unit of society that is cohesive despite its internal ideological diversity. However, it can be hard to tell whether Kelty is describing what he thinks is already the case or what he aspires for it to be.

The question is whether the recursive public is referring to the social imaginary of the FOSS movement or its concrete practices (which he lists: arguing about license, sharing source code, conceiving of the open, and coordinating collaboration). He does brilliant work in showing how the contemporary FOSS movement is a convergence of the latter. Misusing a term of Piaget’s, I’m tempted to call this an operational synthesis, analogous to how a child’s concept of time is synthesized through action from multiple phenomenological modalities. Perhaps it’s not irresponsible to refer to the social synthesis of a unified practice from varied origins with the same term.

Naming these practices, then, is a way of making them conscious and providing the imaginary with a new understanding of its situation.

Saskia Sassen in Territory, Authority, Rights notes that in global activism, action and community organization is highly local; what is global is the imagined movement in which one participates. Manuel Castells refers to this as the power of identity in social movements; the deliberate “reprogramming of networks” (of people) with new identities is a form of communication power that can exert political change.

It’s difficult for me to read Two Bits and not suspect Kelty of deliberately proposing the idea of a recursive public as an intellectual contribution to the self-understanding of the FOSS movement in a way that is inclusive of those that vehemently deny that FOSS is a movement. By identifying a certain set of shared practices as a powerful social force with its own logic in spite of and even because of its own internal ideological cacophony (libertarian or socialist? freedom or openness? fun, profit, or justice?), he is giving people engaged in those practices a kind of class consciousness–if they read his book.

That is good, because the recursive public is only one of many powers tussling over control of the Internet, and it’s a force for justice.

on courage in the face of failure developing bluestocking

It would be easy to be discouraged by early experiments with bluestocking.

sb@lebenswelt:~/dev/bluestocking$ python factchecker.py "Courage is what makes us. Courage is what divides us. Courage is what drives us. Courage is what stops us. Courage creates news. Courage demands more. Courage creates blame. Courage brings shame. Courage shows in school. Courage determines the cool. Courage divides the weak. Courage pours out like a leak. Courage puts us on a knee. Courage makes us free. Courage makes us plea. Courage helps us flee. Corey Fauchon"
Looking up Fauchon
Lookup failed
Looking up shame
Looking up news
Looking up puts
Lookup failed
Looking up leak
Lookup failed
Looking up stops
Lookup failed
Looking up Courage
Looking up helps
Lookup failed
Looking up divides
Lookup failed
Looking up shows
Lookup failed
Looking up demands
Lookup failed
Looking up pours
Lookup failed
Looking up brings
Lookup failed
Looking up weak
Lookup failed
Looking up drives
Lookup failed
Looking up free
Looking up blame
Lookup failed
Looking up Corey
Lookup failed
Looking up plea
Lookup failed
Looking up knee
Looking up flee
Lookup failed
Looking up cool
Looking up school
Looking up determines
Lookup failed
Looking up like
Looking up us
Lookup failed
Looking up creates
Lookup failed
Looking up makes
Lookup failed
Building knowledge base
Querying knowledge base with original document
Consistency: 0
Contradictions: []
Supported: []
Novel: [(True, 'helps', 'flee'), (True, 'helps', 'us'), (True, 'determines', 'cool'), (True, 'like', 'leak'), (True, 'puts', 'knee'), (True, 'puts', 'us'), (True, 'pours', 'leak'), (True, 'pours', 'like'), (True, 'brings', 'shame'), (True, 'drives', 'us'), (True, 'stops', 'us'), (True, 'creates', 'blame'), (True, 'creates', 'news'), (True, 'Courage', 'shame'), (True, 'Courage', 'news'), (True, 'Courage', 'puts'), (True, 'Courage', 'leak'), (True, 'Courage', 'stops'), (True, 'Courage', 'helps'), (True, 'Courage', 'divides'), (True, 'Courage', 'shows'), (True, 'Courage', 'demands'), (True, 'Courage', 'pours'), (True, 'Courage', 'brings'), (True, 'Courage', 'weak'), (True, 'Courage', 'drives'), (True, 'Courage', 'free'), (True, 'Courage', 'blame'), (True, 'Courage', 'plea'), (True, 'Courage', 'knee'), (True, 'Courage', 'flee'), (True, 'Courage', 'cool'), (True, 'Courage', 'school'), (True, 'Courage', 'determines'), (True, 'Courage', 'like'), (True, 'Courage', 'us'), (True, 'Courage', 'creates'), (True, 'Courage', 'makes'), (True, 'us', 'knee'), (True, 'us', 'flee'), (True, 'us', 'plea'), (True, 'us', 'free'), (True, 'Corey', 'Fauchon'), (True, 'makes', 'plea'), (True, 'makes', 'free'), (True, 'makes', 'us'), (True, 'divides', 'weak'), (True, 'divides', 'us'), (True, 'shows', 'school')]

But, then again, our ambitions are outlandish. Nevertheless, there is a silver lining:

sb@lebenswelt:~/dev/bluestocking$ python factchecker.py "The sky is not blue."
Looking up blue
Looking up sky
Building knowledge base
Querying knowledge base with original document
Consistency: -1
Contradictions: [(True, 'sky', 'blue')]
Supported: []
Novel: []

Why federally funded software should be open source

Recently, open access to government funded research has gained attention and traction. Britain and Europe have both announced that they will make research they fund open access. In the United States, a community-driven effort has pushed a Whitehouse petition to the Obama administration for a similar policy. We may be experiencing a sea change.

Perhaps on the coattails of this movement, Open Source for America has launched a petition asking for a similar policy regarded federally funded software development: share all government-developed software under an open source license.

This is a really good idea.

Unfortunately, software development and the government IT procurement are so misunderstood that this is not likely to excite those who aren’t somehow directly by the issue. That is too bad, because every American stands to benefit from this sort of change. That makes it important for those of us who do understand to act.

I’ll try to illustrate why this is important with a story, or really a template of a story. This is a story told in countless cases of government software procurement:

ACRNM, a federal agency, has realized that its database management system and its user interface have not been updated since the late 90’s, because building it the last time was such a headache. It never really worked the way they wanted it, and the vendor who built it for them has since vanished off the face of the earth. Desperate and beleaguered, ACRNM finally gets the budget together to build a new system, and put out a bid.

Vendors that have navigated the prerequisite bureaucratic maze flock to this bid, knowing victory will be lucrative. Among them is FUBAR Enterprise Solutions. They know that whatever they build, they have a revenue stream for life. Not only does ACRNM have an enormous internal incentive to declare the new system a success to justify their budget, but they also have nobody to turn to for help with their software when it inevitably fails but FUBAR. FUBAR can continue extorting ACRNM for cash until ACRNM gives up, and the cycle continues.

What is wrong with this picture? Let’s count the problems:

  • FUBAR has ACRNM by the (pardon me, there’s really no other way to put this) balls. The term is vendor lock-in. The second ACRNM installs their system, FUBAR becomes a parasite on the government leeching taxpayer money. This is because the software is proprietary. No other company is legally allowed to fix or modify FUBAR’s proprietary system, so FUBAR faces no competition and so can charge through the nose. If the software were open source, ACRNM could turn to other contractors to repair their system, lowering total costs.
  • ACRNM has to do its work with worse software. Remember, this is a government agency that we pay taxes to for their services. With so much government activity boiling down to bureaucratic information processing, and so much innovation in software engineering and design, and so much budgetary pressure, you would think that the federal government would leap at technological innovation. But proprietary contracting causes the government to cripple itself at a tipping point.
  • Today, government agencies like ACRNM are wisening up and turning to open source solutions. But it’s a slow, slow process. This is partly because FUBAR and its buddy companies who, after so many years of this relationship with government, are now an entrenched lobby that will sow Fear, Uncertainty, and Doubt about open source alternatives if they can get away with it. In recent years, since open source has become more mainstream, these companies are admitting the viability of open source compatibility and mixed solutions. They see the writing on the wall. They will of course fight an open source purchasing mandate with everything they have.
  • Few governmental problems are unique. If ACRNM is paying for a new custom software solution, there likely many other agencies–at federal, state, or local level–with a similar problem. Civic Commons has already jumped on this opportunity by trying to facilitate technology reuse across city governments. If ACRNM invests in an open source solution, then other agencies can seek out that solution and adapt it to their needs, reducing government IT costs overall.
  • As we’ve discussed, open source software creates a competitive market for services. That makes an open source mandate a job creation program. Every new open technology is an opportunity for several small businesses to open. These are businesses that share fixed costs to market entry and add value through technologist consulting and custom development. Jobs customizing existing open source solutions can be well-paid with even an entry-level programming skill set, and are a good way to build a lasting career in the technology sector. Federal investment in open source software builds our national supply of technology skill faster than proprietary investment.
  • Lastly, but certainly not least, is the possible reuse of open source technology by the private sector. Just as federally funded research contributes to growth in America’s scientific industry, federal investment in software provides a foundation for stronger tech companies. Openness in both cases expands the impact of the funding.

So, to recap: if this sort of policy passes, the winners are government employees, taxpayers, entry-level workers with a minimum of technical skills, and the tech industry in general. The losers (in the short term) are those existing companies that have the federal government locked into custom proprietary software contracts.

I want to make a point clear: I am talking specifically about new software development in this post. Purchasing licenses for existing proprietary software is a different story.

Brian Carver, professor at UC Berkeley School of Information, has offered this clarification of what an open source mandate could look like:

  1. An unambiguous policy and awareness that all software created by
    federal employees as part of their job duties is not subject to copyright
    at all and is born in the public domain, and therefore not subject to any
    license terms at all, including a FOSS license.
  2. Given 1, the federal government should either just use github/bitbucket
    or set up a similar repository to share all such federal government
    software that is in the public domain.
  3. When the federal government contracts with developers for software,
    there should be an unambiguous policy that all such software must be
    licensed under a FOSS license unless subject to a specifically-requested
    exemption (national security, military, etc.)

A central election issue is the size and role of government in the economy. Politicians on the right advocate for smaller government and a strong private sector with competitive markets. Politicians on the left advocate for government’s active investment in the economy.

Proprietary government-developed software is the worst of both worlds: inefficient government spending to create parasitic, uncompetitive companies that don’t invest their technology back into the economy. An open source mandate would give us the best of both worlds: efficient government spending that shrinks government (by easing overhead) while investing in new technology and competitive businesses.

The movement for open access to government funded research is strong and winning victories around the world. Maybe we can do the same for government funded software development.

We need help naming a software project

Speaking of computational argumentation, Dave Kush and I are starting a software project and we need a name for it.

The purpose of the software is to extract information from a large number of documents, and then merge this information together into a knowledge base. We think this could be pretty great because it would support:

  • Conflict detection and resolution. In the process of combining information from many sources into a single knowledge base, the system should be able to mark conflicts of information. That would indicate an inconsistency or controversy between the documents, which could be flagged for further investigation.
  • Naturally queryable aggregate knowledge. We anticipate being able to build a query interface that is a natural extension of this system: just run the query through the extraction process and compare the result for consistency with the knowledge base. This would make the system into a “dissonance engine,” useful for opposition research or the popping of filter bubbles.

I should say that neither of us knows exactly what we are doing. But Dave’s almost got his PhD in human syntax so I think we’ve got a shot at building a sweet parser. What’s more, we’ve got the will and plan. It will be open source, of course, and we’re eager for collaborators.

We have one problem:

We don’t know what to call it.

I can’t even make the GitHub account for our code until we have a good name. And until then we’ll be sending Python scripts to each other as email attachments and that will never get anywhere.

Please help us. Tell us what to name our project. If we use your name, we’ll do something awesome for you some day.

Scratch that. We’re calling it Bluestocking. The GitHub repo is here.

Academia vs. FOSS: The Good, The Bad, and the Ugly

Mel Chua has been pushing forward on the theme of FOSS culture in academia, and has gotten a lot of wonderful comments, many about why it’s not so simple to just port one culture over to the other. I want to try to compile items from Mel, comments on that post, and a few other sources. The question is: what are the salient differences between FOSS and academia?

I will proceed using the now-standard Spaghetti Western classification schema.

The Good

  • Universities tend to be more proactive about identifying and aiding newcomers that are struggling, as opposed to many FOSS projects that have high failure-and-dropout rates due to poorly designed scaffolding.
  • Academia is much more demographically inclusive. FOSS communities are notoriously imbalanced in terms of gender and race.

The Bad

  • The academic fear of having ones results scooped or stolen results in redundant, secrecy, and lonely effort. FOSS communities get around this by having good systems for attribution of incremental progress.
  • Despite scientific ideals, academic scientific research is getting less reproducible, and therefore less robust, because of closed code and data. FOSS work is often more reproducible (though not if its poorly documented).
  • Closed access academic journals hold many disciplines hostage by holding a monopoly on prestige. This is changing with the push for open access research, but this is still a significant issue. FOSS communities may care about community prestige, but often that prestige comes from community helpfulness or stake in a project. If metrics are used, they are often implicit ones extractable from the code repository itself, like Ohloh. Altmetrics are a solution to this problem.

The Ugly

  • In both FOSS and academia, a community of collaborators needs to form around shared interests and skills. But FOSS has come to exemplify the power of the distributed collaboration towards pragmatic goals. One is judged more by ones contributions than by ones academic pedigree, which means that FOSS does not have as much institutional gatekeeping.
  • Tenure committees look at papers published, not software developed. So there is little incentive for making robust software as part of the research process, however much that might allow reproducibility and encourage collaboration.
  • Since academics are often focused on “the frontier”, they don’t pay much attention to “building blocks”. Academic research culture tends to encourage this because it’s a race for discovery. FOSS regards care of the building blocks as a virtue and rewards the effort with stronger communities built on top of those blocks.
  • One reason for the difference between academia and FOSS is bandwidth. Since publications have page limits and are also the main means of academic communication, one wants to dedicate as much space as possible to juicy results at the expense of process documentation that would aid reproducibility. Since FOSS developed using digital communication tools with fewer constraints, it doesn’t have this problem. But academia doesn’t yet value contributions to this amorphous digital wealth of knowledge.

Have I left anything out?

Don’t use Venn diagrams like this

Today I saw this whitepaper by Esri about their use of open source software. It’s old, but still kept my attention.

There’s several reasons why this paper is interesting. One reason is that it reflects the trend of companies that once used FUD tactics around open source software to singing a soothing song of compatibilism. It makes an admirable effort to explain the differences between open source, proprietary software, and open standards to its enterprise client audience. That is the good news.

The bad news is that since this new compatibilism is just bending to market pressure after the rise of successful open source software complements, it lacks an understanding of why the open source development process has caused those market successes. Of course, proprietary companies have good reason to blur these lines, because otherwise they would need to acknowledge the existence of open source substitutes. In Esri’s case, that would mean products like the OpenGeo Suite.

I probably wouldn’t have written this post if it were not for this Venn diagram, which is presented with the caption A hybrid relationship:

I don’t think there is a way to interpret this diagram in a way that makes sense. It correctly identifies that Closed Source, Open Source, and Open Standards are different. But what do the overlapping regions represent? Presumabely they are meant to indicate that a system may both be open source and use open standards, or have open standards and be closed, or…be both open and closed?

It’s a subtle point but the semantics of set containment implied by the Venn diagram really don’t apply here. A system that’s a ‘hybrid’ between a closed and open software is not “both” closed and open the same way closed software that uses open standards is “both” closed and open. Rather, the hybrid system is just that, a hybrid, which means that its architecture is going to suffer tradeoffs as different components have different properties.

I don’t think that the author of this whitepaper was trying to deliberately obscure this idea. But I think that they didn’t know or care about it. That’s a problem, because it’s marketing material like this that clouds the picture about the value of open source. At a pointy-haired managerial level, one can answer the question “why aren’t you using more open source software” with a glib, “oh, we’re using a hybrid model, tailored to our needs.” But unless you actually understand what you’re talking about, your technical stack may still be full of buggy and unaccountable software, without you even knowing it.