open source software | Digifesto

February 22, 2015

Hirschman, Nigerian railroads, and poor open source user interfaces

Hirschman says he got the idea for Exit, Voice, and Loyalty when studying the failure of the Nigerian railroad system to improve quality despite the availability of trucking as a substitute for long-range shipping. Conventional wisdom among economists at the time was that the quality of a good would suffer when it was provisioned by a monopoly. But why would a business that faced healthy competition not undergo the management changes needed to improve quality?

Hirschman’s answer is that because the trucking option was so readily available as an alternative, there wasn’t a need for consumers to develop their capacity for voice. The railroads weren’t hearing the complaints about their service, they were just seeing a decline in use as their customers exited. Meanwhile, because it was a monopoly, loss in revenue wasn’t “of utmost gravity” to the railway managers either.

The upshot of this is that it’s only when customers are locked in that voice plays a critical role in the recuperation mechanism.

This is interesting for me because I’m interested in the role of lock-in in software development. In particular, one argument made in favor of open source software is that because it is not technology held by a single firm, users of the software are not locked-in. Their switching costs are reduced, making the market more liquid and, in theory favorable.

You can contrast this with proprietary enterprise software, where vendor lock-in is a principle part of the business model as this establishes the “installed base” and customer support armies are necessary for managing disgruntled customer voice. Or, in the case of social media such as Facebook, network effects create a kind of perceived consumer lock-in and consumer voice gets articulated by everybody from Twitter activists to journalists to high-profile academics.

As much as it pains me to admit it, this is one good explanation for why the user interfaces of a lot of open source software projects are so bad specifically if you combine this mechanism with the idea that user-centered design is important for user interfaces. Open source projects generally make it easy to complain about the software. If they know what they are doing at all, they make it clear how to engage the developers as a user. There is a kind of rumor out there that open source developers are unfriendly towards users and this is perhaps true when users are used to the kind of customer support that’s available on a product for which there is customer lock-in. It’s precisely this difference between exit culture and voice culture, driven by the fundamental economics of the industry, that creates this perception. Enterprise open source business models (I’m thinking about models like the Pentaho ‘beekeeper’) theoretically provide a corrective to this by being an intermediary between consumer voice and developer exit.

A testable hypothesis is whether and to what extent a software project’s responsiveness to tickets scales with the number of downstream dependent projects. In software development, technical architecture is a reasonable proxy for industrial organization. A widely used project has network effects that increasing switching costs for its downstream users. How do exit and voice work in this context?

Leave a comment

February 21, 2015

The node.js fork — something new to think about

For Classics we are reading Albert Hirschman’s Exit, Voice, and Loyalty. Oddly, though normally I hear about ‘voice’ as an action from within an organization, the first few chapters of the book (including the introduction of the Voice concept itselt), are preoccupied with elaborations on the neoclassical market mechanism. Not what I expected.

I’m looking for interesting research use cases for BigBang, which is about analyzing the sociotechnical dynamics of collaboration. I’m building it to better understand open source software development communities, primarily. This is because I want to create a harmonious sociotechnical superintelligence to take over the world.

For a while I’ve been interested in Hadoop’s interesting case of being one software project with two companies working together to build it. This is reminiscent (for me) of when we started GeoExt at OpenGeo and Camp2Camp. The economics of shared capital are fascinating and there are interesting questions about how human resources get organized in that sort of situation. In my experience, there becomes a tension between the needs of firms to differentiate their products and make good on their contracts and the needs of the developer community whose collective value is ultimately tied to the robustness of their technology.

Unfortunately, building out BigBang to integrate with various email, version control, and issue tracking backends is a lot of work and there’s only one of me right now to both build the infrastructure, do the research, and train new collaborators (who are starting to do some awesome work, so this is paying off.) While integrating with Apache’s infrastructure would have been a smart first move, instead I chose to focus on Mailman archives and git repositories. Google Groups and whatever Apache is using for their email lists do not publish their archives in .mbox format, which is pain for me. But luckily Google Takeout does export data from folks’ on-line inbox in .mbox format. This is great for BigBang because it means we can investigate email data from any project for which we know an insider willing to share their records.

Does a research ethics issue arise when you start working with email that is openly archived in a difficult format, then exported from somebody’s private email? Technically you get header information that wasn’t open before–perhaps it was ‘private’. But arguably this header information isn’t personal information. I think I’m still in the clear. Plus, IRB will be irrelevent when the robots take over.

All of this is a long way of getting around to talking about a new thing I’m wondering about, the Node.js fork. It’s interesting to think about open source software forks in light of Hirschman’s concepts of Exit and Voice since so much of the activity of open source development is open, virtual communication. While you might at first think a software fork is definitely a kind of Exit, it sounds like IO.js was perhaps a friendly fork of just somebody who wanted to hack around. In theory, code can be shared between forks–in fact this was the principle that GitHub’s forking system was founded on. So there are open questions (to me, who isn’t involved in the Node.js community at all and is just now beginning to wonder about it) along the lines of to what extent a fork is a real event in the history of the project, vs. to what extent it’s mythological, vs. to what extent it’s a reification of something that was already implicit in the project’s sociotechnical structure. There are probably other great questions here as well.

A friend on the inside tells me all the action on this happened (is happening?) on the GitHub issue tracker, which is definitely data we want to get BigBang connected with. Blissfully, there appear to be well supported Python libraries for working with the GitHub API. I expect the first big hurdle we hit here will be rate limiting.

Though we haven’t been able to make integration work yet, I’m still hoping there’s some way we can work with MetricsGrimoire. They’ve been a super inviting community so far. But our software stacks and architecture are just different enough, and the layers we’ve built so far thin enough, that it’s hard to see how to do the merge. A major difference is that while MetricsGrimoire tools are built to provide application interfaces around a MySQL data backend, since BigBang is foremost about scientific analysis our whole data pipeline is built to get things into Pandas dataframes. Both projects are in Python. This too is a weird microcosm of the larger sociotechnical ecosystem of software production, of which the “open” side is only one (important) part.

Leave a comment

December 21, 2014

Imre Lakatos and programming as dialectic

My dissertation is about the role of software in scholarly communication. Specifically, I’m interested in the way software code is itself a kind of scholarly communication, and how the informal communications around software production represent and constitute communities of scientists. I see science as a cognitive task accomplished by the sociotechnical system of science, including both scientists and their infrastructure. Looking particularly at scientist’s use of communications infrastructure such as email, issue trackers, and version control, I hope to study the mechanisms of the scientific process much like a neuroscientist studies the mechanisms of the mind by studying neural architecture and brainwave activity.

To get a grip on this problem I’ve been building BigBang, a tool for collecting data from open source projects and readying it for scientific analysis.

I have also been reading background literature to give my dissertation work theoretical heft and to procrastinate from coding. This is why I have been reading Imre Lakatos’ Proofs and Refutations (1976).

Proofs and Refutations is a brilliantly written book about the history of mathematical proof. In particular, it is an analysis of informal mathematics through an investigation of the letters written by mathematicians working on proofs about the Euler characteristic of polyhedra in the 18th and 19th centuries.

Whereas in the early 20th century, based on the work of Russel and Whitehead and others, formal logic was axiomatized, prior to this mathematical argumentation had less formal grounding. As a result, mathematicians would argue not just substantively about the theorem they were trying to prove or disprove, but also about what constitutes a proof, a conjecture, or a theorem in the first place. Lakatos demonstrates this by condensing 200+ years of scholarly communication into a fictional, impassioned classroom dialog where characters representing mathematicians throughout history banter about polyhedra and proof techniques.

What’s fascinating is how convincingly Lakatos presents the progress of mathematical understanding as an example of dialectical logic. Though he doesn’t use the word “dialectical” as far as I’m aware, he tells the story of the informal logic of pre-Russellian mathematics through dialog. But this dialog is designed to capture the timeless logic behind what’s been said before. It takes the reader through the thought process of mathematical discovery in abbreviated form.

I’ve had conversations with serious historians and ethnographers of science who would object strongly to the idea of a history of a scientific discipline reflecting a “timeless logic”. Historians are apt to think that nothing is timeless. I’m inclined to think that the objectivity of logic persists over time much the same way that it persists over space and between subjects, even illogical ones, hence its power. These are perhaps theological questions.

What I’d like to argue (but am not sure how) is that the process of informal mathematics presented by Lakatos is strikingly similar to that used by software engineers. The process of selecting a conjecture, then of writing a proof (which for Lakatos is a logical argument whether or not it is sound or valid), then having it critiqued with counterexamples, which may either be global (counter to the original conjecture) or local (counter to a lemma), then modifying the proof, then perhaps starting from scratch based on a new insight… all this reads uncannily like the process of debugging source code.

The argument for this correspondence is strengthened by later work in theory of computation and complexity theory. I learned this theory so long ago I forget who to attribute it to, but much of the foundational work in computer science was the establishment of a correspondence between classes of formal logic and classes of programming languages. So in a sense its uncontroversial within computer science to consider programs to be proofs.

As I write I am unsure whether I’m simply restating what’s obvious to computer scientists in an antiquated philosophical language (a danger I feel every time I read a book, lately) or if I’m capturing something that could be an interesting synthesis. But my point is this: that if programming language design and the construction of progressively more powerful software libraries is akin to the expanding of formal mathematical knowledge from axiomatic grounds, then the act of programming itself is much more like the informal mathematics of pre-Russellian mathematics. Specifically, in that it is unaxiomatic and proofs are in play without necessarily being sound. When we use a software system, we are depending necessarily on a system of imperfected proofs that we fix iteratively through discovered counterexamples (bugs).

Is it fair to say, then, that whereas the logic of software is formal, deductive logic, the logic of programming is dialectical logic?

Bear with me; let’s presume it is. That’s a foundational idea of my dissertation work. Proving or disproving it may or may not be out of scope of the dissertation itself, but it’s where it’s ultimately headed.

The question is whether it is possible to develop a formal understanding of dialectical logic through a scientific analysis of the software collaboration. (see a mathematical model of collective creativity). If this could be done, then we could then build better software or protocols to assist this dialectical process.

Leave a comment

September 17, 2014

technical work

Dipping into Julian Orr’s Talking about Machines, an ethnography of Xerox photocopier technicians, has set off some light bulbs for me.

First, there’s Orr’s story: Orr dropped out of college and got drafted, then worked as a technician in the military before returning to school. He paid the bills doing technical repair work, and found it convenient to do his dissertation on those doing photocopy repair.

Orr’s story reminds me of my grandfather and great-uncle, both of whom were technicians–radio operators–during WWII. Their civilian careers were as carpenters, building houses.

My own dissertation research is motivated by my work background as an open source engineer, and my own desire to maintain and improve my technical chops. I’d like to learn to be a data scientist; I’m also studying data scientists at work.

Further fascinating was Orr’s discussion of the Xerox technician’s identity as technicians as opposed to customers:

The distinction between technician and customer is a critical division of this population, but for technicians at work, all nontechnicians are in some category of other, including the corporation that employs the technicians, which is seen as alien, distant, and only sometimes an ally.

It’s interesting to read about this distinction between technicians and others in the context of Xerox photocopiers when I’ve been so affected lately by the distinction between tech folk and others and data scientists and others. This distinction between those who do technical work and those who they serve is a deep historical one that transcends the contemporary and over-computed world.

I recall my earlier work experience. I was a decent engineer and engineering project manager. I was a horrible account manager. My customer service skills were abysmal, because I did not empathize with the client. The open source context contributes to this attitude, because it makes a different set of demands on its users than consumer technology does. One gets assistance with consumer grade technology by hiring a technician who treats you as a customer. You get assistance with open source technology by joining the community of practice as a technician. Commercial open source software, according to the Pentaho beekeeper model, is about providing, at cost, that customer support.

I’ve been thinking about customer service and reflecting on my failures at it a lot lately. It keeps coming up. Mary Gray’s piece, When Science, Customer Service, and Human Subjects Research Collide explicitly makes the connection between commercial data science at Facebook and customer service. The ugly dispute between Gratipay (formerly Gittip) and Shanley Kane was, I realized after the fact, a similar crisis between the expectations of customers/customer service people and the expectations of open source communities. When “free” (gratis) web services display a similar disregard for their users as open source communities do, it’s harder to justify in the same way that FOSS does. But there are similar tensions, perhaps. It’s hard for technicians to empathize with non-technicians about their technical problems, because their lived experience is so different.

It’s alarming how much is being hinged on the professional distinction between technical worker and non-technical worker. The intra-technology industry debates are thick with confusions along these lines. What about marketing people in the tech context? Sales? Are the “tech folks” responsible for distributional justice today? Are they in the throws of an ideology? I was reading a paper the other day suggesting that software engineers should be held ethically accountable for the implicit moral implications of their algorithms. Specifically the engineers; for some reason not the designers or product managers or corporate shareholders, who were not mentioned. An interesting proposal.

Meanwhile, at the D-Lab, where I work, I’m in the process of navigating my relationship between two teams, the Technical Team, and the Services Team. I have been on the Technical team in the past. Our work has been to stay on top of and assist people with data science software and infrastructure. Early on, we abolished regular meetings as a waste of time. Naturally, there was a suspicion expressed to me at one point that we were unaccountable and didn’t do as much work as others on the Services team, which dealt directly with the people-facing component of the lab–scheduling workshops, managing the undergraduate work-study staff. Sitting in on Services meetings for the first time this semester, I’ve been struck by how much work the other team does. By and large, it’s information work: calendering, scheduling, entering into spreadsheets, documenting processes in case of turnover, sending emails out, responding to emails. All important work.

This is exactly the work that information technicians want to automate away. If there is a way to reduce the amount of calendering and entering into spreadsheets, programmers will find a way. The whole purpose of computer science is to automate tasks that would otherwise be tedious.

Eric S. Raymond’s classic (2001) essay How to Become a Hacker characterizes the Hacker Attitude, in five points:

The world is full of fascinating problems waiting to be solved.
No problem should ever have to be solved twice.
Boredom and drudgery are evil.
Freedom is good.
Attitude is no substitute for competence.

There is no better articulation of the “ideology” of “tech folks” than this, in my opinion, yet Raymond is not used much as a source for understanding the idiosyncracies of the technical industry today. Of course, not all “hackers” are well characterized by Raymond (I’m reminded of Coleman’s injunction to speak of “cultures of hacking”) and not all software engineers are hackers (I’m sure my sister, a software engineer, is not a hacker. For example, based on my conversations with her, it’s clear that she does not see all the unsolved problems with the world to be intrinsically fascinating. Rather, she finds problems that pertain to some human interest, like children’s education, to be most motivating. I have no doubt that she is a much better software engineer than I am–she has worked full time at it for many years and now works for a top tech company. As somebody closer to the Raymond Hacker ethic, I recognize that my own attitude is no substitute for that competence, and hold my sister’s abilities in very high esteem.)

As usual, I appear to have forgotten where I was going with this.

Leave a comment

June 24, 2014

Protected: some ruminations regarding ‘openness’

Enter your password to view comments.

December 16, 2013

Reflections on the Berkeley Institute for Data Science (BIDS) Launch

Last week was the launch of the Berkeley Institute for Data Science.

Whatever might actually happen as a result of the launch, what was said at the launch was epic.

Vice Chancellor of research Graham Flemming introduced Chancellor Nicholas Dirks for the welcoming remarks. Dirks is UC Berkeley’s 10th Chancellor. He succeeded Robert Birgeneau, who resigned gracefully shortly after coming under heavy criticism for his handling of Occupy Cal, the Berkeley campus’ chapter of the Occupy movement. He was distinctly unsympathetic to the protesters, and there was a widely circulated petition declaring a lack of confidence in his leadership. Birgeneau is a physicist. Dirks is an anthropologist who has championed postcolonial approaches. Within the politics of the university, which are a microcosm of politics at large, this signalling is clear. Dirks’ appointment was meant to satisfy the left wing protesters, most of whom have been trained in softer social sciences themselves. Critical reflection on power dynamics and engagement in activism–which is often associated with leftist politics–are, at least formally, accepted by the university administration as legitimate. Birgeneau would subsequently receive awards for his leadership in drawing more women into the sciences and aiding undocumented students.

Dirks’ welcoming remarks were about the great accomplishments of UC Berkeley as a research institution and the vague but extraordinary potential of BIDS. He is grateful, as we all are, for the funding from the Moore and Sloan foundations. I found his remarks unspecific, and I couldn’t help but wonder what his true thoughts were about data science in the university. Surely he must have an opinion. As an anthropologist, can he consistently believe that data science, especially in the social sciences, is the future?

Vicki Chandler, Chief Program Officer from the Moore Foundation, was more lively. Pulling no punches, she explained that the purpose of BIDS is to shake up scientific culture. Having hung out in Berkeley in the 60’s and attended it as an undergraduate in the 70’s, she believes we are up for it. She spoke again and again of “revolution”. There is ambiguity in this. In my experience, faculty are divided on whether they see the proposed “open science” changes as imminent or hype, as desirable or dangerous. More and more I see faculty acknowledge that we are witnessing the collapse of the ivory tower. It is possible that the BIDS launch is a tipping point. What next? “Let the fun begin!” concluded Chandler.

Saul Perlmutter, Nobel laureate physicist and front man of the BIDS co-PI super group, gave his now practiced and condensed pitch for the new Institute. He hit all the high points, pointing out not only the potential of data science but the importance of changing the institutions themselves. Rethinking the peer-review journal from scratch, he said, we should focus more on code reuse. Software can be a valid research output. As much as open science is popular among the new generation of scientists, this is a bold statement for somebody with such credibility within the university. He even said that the success of open source software is what gives us hope for the revolutionary new kind of science BIDS is beginning. Two years ago, this was a fringe idea. Perlmutter may have just made it mainstream.

Notably, he also engaged with the touchy academic politics, saying that data science could bring diversity to the sciences (though he was unspecific about the mechanism for this). He expounded on the important role of ethnography in evaluating the Institute to identify the bottlenecks to its unlocking its potential.

The man has won at physics and is undoubtedly a scientist par excellance. Perhaps Perlmutter sees the next part of his legacy as the bringing of the university system into the 21st century.

David Culler, Chair of the Electrical Engineering and Computer Science department, then introduced a number of academic scientists, each with impressive demonstrations about how data science could be applied to important problems like climate change and disaster reduction. Much of this research depends on using the proliferation of hand-held mobile devices as sensors. University science, I realized while watching this, is at its best when doing basic research about how to save humanity from nature or ourselves.

But for me the most interesting speakers in the first half of the launch were luminaries Peter Norvig and Tim O’Reilly, each giants in their own right and welcome guests to the university.

Culler introduced Norvig, Director of Research at Google, by crediting him as one of the inventors of the MOOC. I know his name mainly as a co-author of “Artificial Intelligence: A Modern Approach,” which I learned and taught from as an undergraduate. Amazingly, Norvig’s main message is about the economics of the digital economy. Marginal production is cheap, cost of communication is cheap, and this leads to an accumulation of wealth. Fifty percent of jobs are predicted to be automated away in the coming decades. He is worried about the 99%–freely using Occupy rhetoric. What will become of them? Norvig’s solution, perhaps stated tongue in cheek, is that everyone needs to become a data scientist. More concretely, he has high hopes for hybrid teams of people and machines, that all professions will become like this. By defining what academic data science looks like and training the next generation of researchers, BIDS will have a role in steering the balance of power between humanity and the machines–and the elite few who own them.

His remarks hit home. He touched on anxieties that are as old as the Industrial Revolution: is somebody getting immensely rich off of these transformations, but not me? What will my role be in this transformed reality? Will I find work? These are real problems and Norvig was brave to bring them up. The academics in the room were not immune from these anxieties either, as they watch the ivory tower crumble around them. This would come up again later in the day.

I admire him for bringing up the point, and I believe he is sincere. I’d heard him make the same points when he was on a panel with Neil Stephenson and Jaron Lanier a month or so earlier. I can’t help but be critical of Norvig’s remarks. Is he covering his back? Many university professors are seeing MOOCs themselves as threatening to their own careers. It is encouraging that he sees the importance of hybrid human/machine teams. If the machines are built on Google infrastructure, doesn’t this contribute to the same inequality he laments, shifting power away from teachers to the 1% at Google? Or does he foresee a MOOC-based educational boom?

He did not raise the possibility that human/machine hybridity is already the status quo–that, for example, all information workers tap away at these machines and communicate with each other through a vast technical network. If he had acknowledged that we are all cyborgs already, he would have had to admit that hybrid teams of humans and machines are as much the cause of as solution to economic inequality. Indeed, this relationship between human labor and mechanical capital is precisely the same as the one that created economic inequality in the Industrial Revolution. When the capital is privately owned, the systems of hybrid human/machine productivity favor the owner of the machines.

I have high hopes that BIDS will address through its research Norvig’s political concern. It is certainly on the mind of some of its co-PI’s, as later discussion would show. But to address the problem seriously, it will have to look at the problem in a rigorous way that doesn’t shy away from criticism of the status quo.

The next speaker, Tim O’Reilly, is a figure who fascinates me. Culler introduced him as a “God of the Open Source Field,” which is poetically accurate. Before coming to academia, I worked on Web 2.0 open source software platforms for open government. My career was defined by a string of terms invented and popularized by O’Reilly, and to a large extent I’m still a devotee of his ideas. But as a practitioner and researcher, I’ve developed a nuanced view of the field that I’ve tried to convey in the course on Open Collaboration and Peer Production I’ve co-instructed with Thomas Maillart this semeser.

O’Reilly came under criticism earlier this year from Evgeny Morozov, who attacked him for marketing politically unctuous ideas while claiming to be revolutionary. He focuses on his promotion of ‘open source’ over and against Richard Stallman’s explicitly ethical and therefore contentious term ‘free software‘. Morozov accuses O’Reilly of what Tom Scocca has recently defined as rhetorical smarm–dodging specific criticism by denying the appropriateness of criticism in general. O’Reilly has disputed the Morozov piece. Elsewhere he has presented his strategy as a ‘marketer of big ideas‘, and his deliberate promoting of more business-friendly ‘open source’ rhetoric. This ideological debate is itself quite interesting. Geek anthropologist Chris Kelty observes that it is participation in this debate, more so than an adherence to any particular view in it, that characterizes the larger “movement,” which he names the recursive public.

Despite his significance to me, with an open source software background, I was originally surprised when I heard Tim O’Reilly would be speaking at the BIDS launch. O’Reilly had promoted ‘open source’ and ‘Web 2.0’ and ‘open government’, but what did that have to do with ‘data science’?

So I was amused when Norvig introduced O’Reilly by saying that he didn’t know he was a data scientist until the latter wrote an article in Forbes (in November 2011) naming him one of “The World’s 7 Most Powerful Data Scientists.” Looking at the Google Trends data, we can see that November 2011 just about marks the rise of ‘data science’ from obscurity to popularity. Is Tim O’Reilly responsible for the rise of ‘data science’?

Perhaps. O’Reilly’s explained that he got into data science by thinking about the end game for open source. As open source software becomes commodified (which for him I think means something like ‘subject to competitive market pressure), what becomes valuable is the data. And so he has been promoting data science in industry and government, and believes that the university can learn important lessons from those fields as well. He held up his Moto X phone, explained how it is ‘always listening’ and so can facilitate services like Google Now. All this would go towards a system with greater collective intelligence, a self-regulating system that would make regulators obsolete.

Looking at the progression of the use of maps, from paper to digital to being embedded in services and products like self-driving cars, O’Reilly agrees with Norvig about the importance of human-machine interaction. In particular, he believes that data scientists will need to know how to ask the right questions about data, and that this is the future of science. “Others will be left behind,” he said, not intending to sound foreboding.

I thought O’Reilly presented the combination of insight and boosterism I expected. To me, his presence at the BIDS launch meant to me that O’Reilly’s significance as a public intellectual has progressed from business through governance and now to scientific thinking itself. This is wonderful for him but means that his writings and influence should be put under the scrutiny we would have for an academic peer. It is appropriate to call him out for glossing over the privacy issues around a mobile phone that is “always listening,” or the moral implications of the obsolescence of regulators for equality and justice. Is his objectivity compromised by the fact that he runs a publishing company that sells complementary goods to the vast supply of publicly available software and data? Does his business agenda incentivize him to obscure the subtle differences between various segements of his market? Are we in the university victims of that obscurity as we grapple with multiple conflated meanings of “openness” in software and science (open to scrutiny and accountability, vs. open for appropriation by business, vs. open to meritocratic contribution)? As we ask these questions, we can be grateful to O’Reilly for getting us this far.

I’ve emphasized the talks given by Norvig and O’Reilly because they exposed what I think are some of the most interesting aspects of BIDS. One way or another, it will be revolutionary. Its funders will be very disappointed if it is not. But exactly how it is revolutionary is undetermined. The fact that BIDS is based in Berkeley, and not in Google or Microsoft or Stanford, guarantees that the revolution will not be an insipid or smarmy one which brushes aside political conflict or morality. Rather, it promises to be the site of fecund political conflict. “Let the fun begin!” said Chandler.

The opening remarks concluded and we broke for lunch and poster sessions–the Data Science Faire (named after O’Reilly’s Maker Faire…

@sbenthall And that in turn was a reference to the Renaissance Faire. Cultural references get buried like potsherds at ancient Troy.

— Tim O'Reilly (@timoreilly) December 12, 2013

What followed was a fascinating panel discussion led by astrophysicist Josh Bloom, historian and university administrator Cathryn Carson, computer science professor and AMP Lab director Michael Franklin, and Deb Agrawal, a staff computer scientist for Lawrence Berkeley National Lab.

Bloom introduced the discussion jokingly as “just being among us scientists…and whoever is watching out there on the Internet,” perhaps nodding to the fact that the scientific community is not yet fully conscious that their expectations of privileged communication are being challenged by a world and culture of mobile devices that are “always listening.”

The conversation was about the role of people in data science.

Carson spoke as a domain scientist–a social scientist who studies scientists. Noting that social scientists tend to work in small teams lead by graduate students motivated by their particular questions, she said her emphasis was on the people asking questions. Agrawal noted that the number of people needed to analyze a data set does not scale with the size of data, but the complexity of data–a practical point. (I’d argue that theoretically we might want to consider “size” of data in terms of its compressibility–which would reflect its complexity. This ignores a number of operational challenges.) For Franklin, people are a computational resource that can be part of a crowd-sourced process. In that context, the number of people needed does indeed scale with the use of people as data processors and sensors.

Perhaps to follow through on Norvig’s line of reasoning, Bloom then asked pointedly if machines would ever be able to do the asking of questions better than human beings. In effect: Would data science make data scientists obsolete?

Nobody wanted to be the first to answer this question. Bloom had to repeat it.

Agrawal took a first stab at it. The science does not come from the data; the scientist chooses models and tests them. This is the work of people. Franklin agreed and elaborated–the wrong data too early can ruin the science. Agrawal noted that computers might find spurious signals in the noise.

Personally, I find these unconvincing answers to Bloom’s question. Algorithms can generate, compare, and test alternative models against the evidence. Noise can, with enough data, be filtered away from the signal. To do so pushes the theoretical limits of computing and information theory, but if Franklin is correct in his earlier point that people are part of the computational process, then there is no reason in principle why these tasks too might not be performed if not assisted by computers.

Carson, who had been holding back her answer to listen to the others, had a bolder proposal: rather than try to predict the future of science, why not focus on the task of building that future?

In another universe, at that moment someone might have asked the one question no computer could have answered. “If we are building the new future of science, what should we build? What should it look like? And how do we get there?” But this is the sort of question disciplined scientists are trained not to ask.

Instead, Bloom brought things back to practicality: we need to predict where science will go in order to know how to educate the next generation of scientists. Should we be focusing on teaching them domain knowledge, or on techniques?

We have at the heart of BIDS the very fundamental problem of free will. Bloom suggests that if we can predict the future, then we can train students in anticipation of it. He is an astrophysics and studies stars; he can be forgiven for the assumption that bodies travel in robust orbits. This environment is a more complex one. How we choose to train students now will undoubtedly affect how science evolves, as the process of science is at once the process of learning and training new scientists. His descriptive question then falls back to the normative one: what science are we trying to build toward?

Carson was less heavy-handed than I would have been in her position. Instead, she asked Bloom how he got interested in data science. Bloom recalled his classical physics training, and the moment he discovered that to answer the kinds of questions he was asking, he would need new methods.

Franklin chimed in on the subject of education. He has heard it said that everyone in the next generation should learn to code. With marked humility for his discipline, he said he did not agree with this. But he said he did believe that everyone in the next generation should learn data literacy, echoing Norvig.

Bloom opened the discussion to questions from the audience.

The first was about the career paths for methodologists who write software instead of papers. How would BIDS serve them? It was a soft ball question which the panel hit out of the park. Bloom noted that the Moore and Sloan funders explicitly asked for the development of alternative metrics to measure the impact of methodologist contributions. Carson said that even with the development of metrics, as an administrator she knew it would be a long march through the institution to get those metrics recognized. There was much work to be done. “Universities got to change,” she rallied. “If we don’t change, Berkeley’s being great in the past won’t make it great in the future,” referring perhaps to the impressive history of research recounted by Chancellor Dirks. There was applause. Franklin pointed out that the open source community has its own metrics already. In some circles some of his students are more famous than he is for developing widely used software. Investors are often asking him when his students will graduate. The future, it seems, is bright for methodologists.

At this point I lost my Internet connection and had to stop livetweeting the panel; those tweets are the notes from which I am writing these reflections. Recalling from memory, there was one more question from Kristina Kangas, a PhD student in Integrative Biology. She cited research about how researchers interpreting data wind up reflecting back their own biases. What did this mean for data science?

Bloom gave Carson the last word. It is a social scientific fact, she said, that scientists interpret data in ways that fit their own views. So it’s possible that there is no such thing as “data literacy”. These are open questions that will need to be settled by debate. Indeed, what then is data science after all? Turning to Bloom, she said, “I told you I would be making trouble.”

Leave a comment

September 25, 2013

Sample UC Berkeley School of Information Preliminary Exam

I’m in the PhD program at UC Berkeley’s School of Information. Today, I had to turn in my Preliminary Exam, a 24-hour open book, open note examination on the chosen subject areas of my coursework. I got to pick an exam committee of three faculty members, one for each area of speciality. My committee consisted of: Doug Tygar, examining me on Information System Design; John Chuang, the committee chair, examining me on Information Economics and Policy; and Coye Cheshire, examining me on Social Aspects of Information. Each asked me a question corresponding to their domain; generously, they targeted their questions at my interests.

In keeping with my personal policy of keeping my research open, and because I learned while taking the exam the unvielling of @horse_ebooks and couldn’t resist working it into the exam, and because maybe somebody enrolled in or thinking about applying for our PhD program might find it interesting, I’m posting my examination here (with some webifying of links).

At the time of this posting, I don’t yet know if I have passed.

1. Some e-mail spam detectors use statistical machine learning methods to continuously retrain a classifier based on user input (marking messages as spam or ham). These systems have been criticized for being vulnerable to mistraining by a skilled adversary who sends “tricky spam” that causes the classifier to be poisoned. Exam question: Propose tests that can determine how vulnerable a spam detector is to such manipulation. (Please limit your answer to two pages.)

Tests for classifier poisoning vulnerability in statistical spam filtering systems can consist of simulating particular attacks that would exploit these vulnerabilities. Many of these tests are described in Graham-Cumming, “Does Bayesian Poisoning exist?”, 2006 [pdf.gz], including:

For classifiers trained on a “natural” training data set D and a modified training data set D’ that has been generated to include more common words in messages labeled as spam, compare specificity, sensitivity, or more generally the ROC plots of each for performance. This simulates an attack that aims to increase the false positive rate by making words common to hammy messages be evaluated as spammy.
Same as above, but construct D’ to include many spam messages with unique words. This exploits a tendency in some Bayesian spam filters to measure the spamminess of a word by the percentage of spam messages that contain it. If successful, the attack dilutes the classifier’s sensitivity to spam over a variety of nonsense features, allowing more mundane spam to get through the filter as false negatives.

These two tests depend on increasing the number of spam messages in the data set in a way that strategically biases the classifier. This is the most common form of mistraining attack. Interestingly, these attacks assume that users will correctly label the poisoning messages as spam. So these attacks depend on weaknesses in the filter’s feature model and improper calibration to feature frequency.

A more devious attack of this kind would depend on deceiving the users of the filtering system to mislabel spam as ham or, more dramatically, acknowledge true ham that drives up the hamminess of features normally found in spam.

An example of an attack of this kind (though perhaps not intended as an attack per se) is @Horse_ebooks, a Twitter account that gained popularity while posting randomly chosen bits of prose and, only occasionally, links to purchase low quality self-help ebooks. Allegedly, it was originally a spam bot engaged in a poisoning/evasion attack, but developed a cult following who appreciated its absurdist poetic style. Its success (which only grew after the account was purchased by New York based performance artist Jacob Bakkila in 2011) inspired an imitative style of Twitter activity.

Assuming Twitter is retraining on this data, this behavior could be seen as a kind of poisoning attack, albeit by filter’s users against the system itself. Since it may benefit some Twitter users to have an inflated number of “followers” to project an exaggerated image of their own importance, it’s not clear whether it is in the interests of the users to assist in spam detection, or to sabotage it.

Whatever the interests involved, testing for this kind of vulnerability to this “tricky ham” attack can be conducted in a similar way to the other attacks: by padding the modified data set D’ with additional samples with abnormal statistical properties (e.g noisy words and syntax), this time labeled as ham, and comparing the classifiers along normal performance metrics.

2. Analytical models of cascading behavior in networks, e.g., threshold-based or contagion-based models, are well-suited for analyzing the social dynamics in open collaboration and peer production systems. Discuss.

Cascading behavior models are well-suited to modeling information and innovation diffusion over a network. They are well-suited to analyzing peer production systems to the extent that their dynamics consist of such diffusion over a non-trivial networks. This is the case when production is highly decentralized. Whether we see peer production as centralized or not depends largely on the scale of analysis.

Narrowing in, consider the problem of recruiting new participants to an ongoing collaboration around a particular digital good, such as an open source software product or free encyclopedia. We should expect the usual cascading models to be informative about the awareness and adoption of the good. But in most cases awareness and adoption are only necessary not sufficient conditions for active participation in production. This is because, for example, contribution may involve incurring additional costs and so be subject to different constraints than merely consuming or spreading the word about a digital good.

Though threshold and contagion models could be adapted to capture some of this reluctance through higher thresholds or lower contagion rates, these models fail to closely capture the dynamics of complex collaboration because they represent the cascading behavior as homogeneous. In many open collaborative projects, contributions (and the individual costs of providing them) are specialized. Recruited participants come equipped with their unique backgrounds. (von Krogh, G., Spaeth, S. & Lakhani, K. R. “Community, joining, and specialization in open source software innovation: a case study.” (2003)) So adapting behavior cascade models to this environment would require, at minimum, parameterization of per node capacities for project contribution. The participants in complex collaboration fulfil ecological niches more than they reflect the dynamics of large networked populations.

Furthermore, at the level of a closely collaborative on-line community, network structure is often trivial. Projects may be centralized around a mailing list, source code repository, or public forum that effectively makes the communication network a large clique of all participants. Cascading behavior models will not help with analysis of these cases.

On the other hand, if we zoom out to look at open collaboration as a decentralized process–say, of all open source software developers, or of distributed joke production on Weird Twitter–then network structure becomes important again, and the effects of diffusion may dominate the internal dynamics of innovation itself. Whether or not a software developer chooses to code in Python or Ruby, for example, may well depend on a threshold of the developer’s neighbors in a communication network. These choices allow for contagious adoption of new libraries and code.

We could imagine a distributed innovation system in which every node maintained its own repository of changes, some of which it developed on its own and others it adapted from its neighbors. Maybe the network of human innovators, each drawing from their experiences and skills while developing new ones in the company of others, is like this. This view highlights the emergent social behavior of open innovation, putting the technical architecture (which may affect network structure but could otherwise be considered exogenous) in the background. (See next exam question).

My opinion is that while cascading behavior models may in decentralized conditions capture important aspects of the dynamics of peer production, the basic models will fall short because they don’t consider the interdependence of behaviors. Digital products are often designed for penetration in different networks. For example, the choice of programming language in which to implement ones project influences its potential for early adoption and recruitment. Analytic modeling of these diffusion patterns with cascade models could gain from augmenting the model with representations of technical dependency.

3. Online communities present many challenges for governance and collective behavior, especially in common pool and peer-production contexts. Discuss the relative importance and role of both (1) site architectures and (2) emergent social behaviors in online common pool and/or peer-production contexts. Your answer should draw from more than one real-world example and make specific note of key theoretical perspectives to inform your response. Your response should take approximately 2 pages.

This question requires some unpacking. The sociotechnical systems we are discussing are composed of both technical architecture (often accessed as a web site, i.e. a “location” accessed through HTTP via a web browser) and human agents interacting socially with each other in a way mediated by the architecture (though not exclusively, c.f. Coleman’s work on in person meetings in hacker communities). If technology is “a man-made means to an end” (Heidegger, Question Concerning Technology), then we can ask of the technical architecture: which man, whose end? So questioning the roles of on-line architecture and emergent behaviors brings us to look at how the technology itself was the result of emergent social behavior of its architects. For we can consider “importance” from either the perspective of the users or that of the architects. These perspectives reflect different interests and so will have different standards for evaluating the importance of its components. (c.f. Habermas, Knowledge and Human Interests)

Let us consider socio-technical systems along a spectrum between two extremes. At one extreme are certain prominent systems–e.g. Yelp and Amazon cultivating common pools of reviews–for which the architects and the users are distinct. The site architecture is a means to the ends of the architects, effected through the stimulation of user activity.

Drawing on Winner (“Do artifacts have politics?”), we can see that this socio-technical arrangement establishes a particular pattern of power and authority. Architects have direct control over the technology, which enables to the limits of its affordances user activity. Users can influence architects through the information their activity generates (often collected through the medium of the technical architecture itself), but have no direct coercive control. Rather, architects design the technology to motivate certain desirable activity using inter-user feedback mechanisms such as ways of expressing gratitude or comparing one’s performance with others. (see Cheshire and Antin, “The Social Psychological Effects of Feedback on the Production of Internet Information Pools”, 2008) In such a system, users can only gain control of their technical environment by exploiting vulnerabilities in the architecture in adversarial moves that looks a bit like security breaches. (See the first exam question for an example of user-driven information sabotage.) More likely, the vast majority of users will choose to free ride on any common pool resources made available and exit the system when inconvenienced, as the environment is ultimately a transactional one of service provider and consumer.

In these circumstances, it is only by design that social behaviors lead to peer production and common pools of resources. Technology, as an expression of the interests of the architects, plays a more important role than social emergence. To clarify the point, I’d argue that Facebook, despite hosting enormous amounts of social activity, does not enable significant peer production because its main design goals are to drive the creation of proprietary user data and ad clicks. Twitter, in contrast, has from the beginning been designed as a more open platform. The information shared on it is often less personal, so activity more easily crosses the boundary from private to public, enabling collective action (see Bimber et al., “Reconceptuaizing Collective Action in the Contemporary Media Environment”, 2005) It has facilitated (with varying consistency) the creation of third party clients, as well as applications that interact with its data but can be hosted as separate sites.

This open architecture is necessary but not sufficient for emergent common pool behavior. But the design for open possibilities is significant. It enables the development of novel, intersecting architectures to support the creation of new common pools. Taking Weird Twitter, framed as a peer production community for high quality tweets, as an example, we can see how the service Favstar (which aggregates and ranks tweets that have been highly “starred” and retweeted, and awards congratulatory tweets as prizes) provides historical reminders and relative rankings of tweet quality. Thereby facilitates a culture of production. Once formed, such a culture can spread and make use of other available architecture as well. Weird Twitter has inspired Twitter: The Comic, a Tumblr account illustrating “the greatest tweets of our generation.”

Consider another extreme case, the free software community that Kelty identifies as the recursive public. (Two Bits: The Cultural Significance of Free Software) In an idealized model, we could say that in this socio-technical system the architects and the users are the same.

The artifacts of the recursive public have a different politics than those at the other end of our spectrum, because the coercive aspects of the architectural design are the consequences of the emergent social behavior of those affected by it. Consequently, technology created in this way is rarely restrictive of productive potential, but on the contrary is designed to further empower the collaborative communities that produced it. The history of Unix, Mozilla, Emacs, version control systems, issue tracking software, Wikimedia, and the rest can be read as the historical unfolding of the human interest in an alternative, emancipated form of production. Here, the emergent social behavior claims its importance over and above the particulars of the technology itself.

Leave a comment

August 5, 2013

Reinventing wheels with Dissertron

I’ve found a vehicle for working on the Dissertron through the website of the course I’ll be co-teaching this Fall on Open Collaboration and Peer Production.

In the end, I went with Pelican, not Hyde. It was a difficult decision (because I did like the idea of supporting an open source project coming out of Chennai, especially after reading Coding Places. But I had it on good authority that Pelican was more featureful with cleaner code. So here I am.

The features I have in mind are crystalizing as I explore the landscape of existing tools more. This is my new list:

Automatically include academic metadata on each Dissertron page so it’s easy to slurp it into Zotero.
Include the Hypothes.is widget for annotations. I think Hypothes.is will be better for commenting that Disqus because it does annotations in-line, as opposed to comments in the footer. It also uses the emerging W3C Open Annotation standard. I’d like this to be as standards based as possible.
Use citeproc-js to render citations in the browser cleanly. I think this handles the issue of in-line linked academic citations without requiring a lot of manual work. The citeproc-js looks like it’s come out of Zotero as well. Since Elsevier bought Mendeley, Zotero seems like the more reliable ally to pick for independent scholarship.
Trickiest is going to be porting a lot of features from jekyll-scholar into a Pelican plug-in. I really want jekyll-scholar‘s bibliographic management. But I’m a little worried that Pelican isn’t well-designed for that sort of flexibility in theming. More soon.
I’m interested in trying to get the HTML output of Dissertron as close as possible to emerging de facto standards on what on-line scholarship is like. I’ve asked about what PLOS ONE does about this. The answer sounds way complicated: a tool chain the goes from Latex to Word Docs to NLM 3.0 XML (which I didn’t even know was a thing), and at last into HTML. I’m trying to start from Markdown because I think it’s a simple markup language for the future, but I’m not deep enough in that tool chain to understand how to replicate its idiosyncracies.

If I could have all these nice things, and maybe a pony, then I would be happy and have no more excuses for not actually doing research, as opposed to obsessing about the tooling around independent publishing.

Leave a comment

June 7, 2013

Planning the Dissertron

In my PhD program, I’ve recently finished my coursework and am meant to start focusing on research for my dissertation. Maybe because of the hubbub around open access research, maybe because I still see myself as a ‘hacker’, maybe because it’s somehow recursively tied into my research agenda, or because I’m an open source dogmatic, I’ve been fantasizing about the tools and technology of publication that I want to work on my dissertation with.

For this project, which I call the Dissertron, I’ve got a loose bundle of requirements feature creeping its way into outer space:

Incremental publishing of research and scholarship results openly to the web.
Version control.
Mathematical rendering a la LaTeX.
Code highlighting a la the hacker blogs.
In browser rendering of data visualizations with d3, where appropriate.
Site is statically generated from elements on the file system, wherever possible.
Machine readable metadata on the logical structure of the dissertation argument, which gets translated into static site navigation elements.
Easily generated glossary with links for looking up difficult terms in-line (or maybe in-margin)
A citation system that takes advantage of hyperlinking between resources wherever possible.
Somehow, enable commenting. But more along the lines of marginalia comments (comments on particular lines or fragments of text) rather than blog comments. “Blog” style comments should be facilitated as notes on separately hosted dissertrons, or maybe a dissertron hub that aggregates and coordinates pollination of content between dissertrons.

This is a lot, and arguably just a huge distraction from working on my dissertation. However, it seems like this or something like it is a necessary next step in the advance of science and I don’t see how I really have much choice in the matter.

Unfortunately, I’m traveling, so I’m going to miss the PLOS workshop on Markdown for Science tomorrow. That’s really too bad, because Scholarly Markdown would get me maybe 50% of the way to what I want.

Right now the best tool chain I can imagine for this involves Scholarly Markdown, run using Pandoc, which I just now figured out is developed by a philosophy professor at Berkeley. Backing it by a Git repository would allow for incremental changes and version control.

Static site generation and hosting is a bit trickier. I feel like GitHub’s support of Jekyll make it a compelling choice, but hacking it to make it fit into the academic frame I’m thinking in might be more trouble than its worth. While it’s a bit of an oversimplification to say this, my impression is that at my university at least there is a growing movement to adopt Python as the programming language of choice for scientific computing. The exceptions seem to be people in the Computer Science department that are backing Scala.

(I like both languages and so can’t complain, except that it makes it harder to do interdisciplinary research if there is a technical barrier in their toolsets. As more of scientific research becomes automated, it is bound to get more crucial that scientific processes (broadly speaking) inter-operate. I’m incidentally excited to be working on these problems this summer for Berkeley’s new Social Science Data Lab. A lot of interesting architectural design is being masterminded by Aaron Culich, who manages the EECS department’s computing infrastructure. I’ve been meaning to blog about our last meeting for a while…but I digress)

Problem is, neither Python or Scala is Ruby, and Ruby is currently leading the game (in my estimate, somebody tell me if I’m wrong) in flexible and sexy smooth usable web design. And then there’s JavaScript, improbably leaking into the back end of the software stack after overflowing the client side.

So for the aspiring open access indie web hipster hacker science self-publisher, it’s hard to navigate the technical terrain. I’m tempted to string together my own rig depending mostly on Pandoc, but even that’s written in Haskell.

These implementation-level problems suggest that the problem needs to be pushed up a level of abstraction to the question of API and syntax standards around scientific web publishing. Scholarly Markdown can be a standard, hopefully with multiple implementations. Maybe there needs to be a standard around web citations as well (since in an open access world, we don’t need the same level of indirection between a document and the works it cites. Like blog posts, web publications can link to the content it derives from directly.)

2 Comments

May 30, 2013

POSSE homework: how to contribute to FOSS without coding

One of the assignments for the POSSE workshop is the question of how to contribute to FOSS when you aren’t a coder.

I find this an especially interesting topic because I think there’s a broader political significance to FOSS, but those that see FOSS as merely the domain of esoteric engineers can sometimes be a little freaked out by this idea. It also involves broader theoretical questions about whether or how open source jives with participatory design.

In fact, they have compiled a list of lists of ways to contribute to FOSS without coding: this, this, this, and this are provided in the POSSE syllabus.

Turning our attention from the question in the abstract, we’re meant to think about it in the context of our particular practices.

For our humanitarian FOSS project of choice, how are we interested in contributing? I’m fairly focused in my interests on open source participation these days: I’m very interested in the problem of community metrics and especially how innovation happens and diffuses within these communities. I would like to be able to build a system for evaluating that kind of thing that can be applied broadly to many projects. Ideally, it could do things like identify talented participants across multiple projects, or suggest interventions for making projects work better.

It’s an ambitious research project, but one for which there is plenty of data to investigate from the open source communities themselves.

What about teaching a course on such a thing? I anticipate that my students are likely to be interested in design as well as positioning their own projects within the larger open source ecosystem. Some of the people who I hope will take the class have been working on FuturePress, an open source e-book reading platform. As they grow the project and build the organization around it, they will want to be working with constituent technologies and devising a business model around their work. How can a course on Open Collaboration and Peer Production support that?

These concerns touch on so many issues outside of the consideration of software engineering narrowly (including industrial organization, communication, social network theory…) that it’s daunting to try to fit it all into one syllabus. But we’ve been working on one that has a significant hands-on component as well. Really I think the most valuable skill in the FOSS world is having the chutzpah to approach a digital community, propose what you are thinking, and take the criticism or responsibility that comes with that.

What concrete contribution a student uses to channel that energy should…well, I feel like it should be up to them. But is that enough direction? Maybe I’m not thinking concretely enough for this assignment myself.

2 Comments

Category: open source software