Digifesto

Tag: open collaboration

Sample UC Berkeley School of Information Preliminary Exam

I’m in the PhD program at UC Berkeley’s School of Information. Today, I had to turn in my Preliminary Exam, a 24-hour open book, open note examination on the chosen subject areas of my coursework. I got to pick an exam committee of three faculty members, one for each area of speciality. My committee consisted of: Doug Tygar, examining me on Information System Design; John Chuang, the committee chair, examining me on Information Economics and Policy; and Coye Cheshire, examining me on Social Aspects of Information. Each asked me a question corresponding to their domain; generously, they targeted their questions at my interests.

In keeping with my personal policy of keeping my research open, and because I learned while taking the exam the unvielling of @horse_ebooks and couldn’t resist working it into the exam, and because maybe somebody enrolled in or thinking about applying for our PhD program might find it interesting, I’m posting my examination here (with some webifying of links).

At the time of this posting, I don’t yet know if I have passed.

1. Some e-mail spam detectors use statistical machine learning methods to continuously retrain a classifier based on user input (marking messages as spam or ham). These systems have been criticized for being vulnerable to mistraining by a skilled adversary who sends “tricky spam” that causes the classifier to be poisoned. Exam question: Propose tests that can determine how vulnerable a spam detector is to such manipulation. (Please limit your answer to two pages.)

Tests for classifier poisoning vulnerability in statistical spam filtering systems can consist of simulating particular attacks that would exploit these vulnerabilities. Many of these tests are described in Graham-Cumming, “Does Bayesian Poisoning exist?”, 2006 [pdf.gz], including:

  • For classifiers trained on a “natural” training data set D and a modified training data set D’ that has been generated to include more common words in messages labeled as spam, compare specificity, sensitivity, or more generally the ROC plots of each for performance. This simulates an attack that aims to increase the false positive rate by making words common to hammy messages be evaluated as spammy.
  • Same as above, but construct D’ to include many spam messages with unique words. This exploits a tendency in some Bayesian spam filters to measure the spamminess of a word by the percentage of spam messages that contain it. If successful, the attack dilutes the classifier’s sensitivity to spam over a variety of nonsense features, allowing more mundane spam to get through the filter as false negatives.

These two tests depend on increasing the number of spam messages in the data set in a way that strategically biases the classifier. This is the most common form of mistraining attack. Interestingly, these attacks assume that users will correctly label the poisoning messages as spam. So these attacks depend on weaknesses in the filter’s feature model and improper calibration to feature frequency.

A more devious attack of this kind would depend on deceiving the users of the filtering system to mislabel spam as ham or, more dramatically, acknowledge true ham that drives up the hamminess of features normally found in spam.

An example of an attack of this kind (though perhaps not intended as an attack per se) is @Horse_ebooks, a Twitter account that gained popularity while posting randomly chosen bits of prose and, only occasionally, links to purchase low quality self-help ebooks. Allegedly, it was originally a spam bot engaged in a poisoning/evasion attack, but developed a cult following who appreciated its absurdist poetic style. Its success (which only grew after the account was purchased by New York based performance artist Jacob Bakkila in 2011) inspired an imitative style of Twitter activity.

Assuming Twitter is retraining on this data, this behavior could be seen as a kind of poisoning attack, albeit by filter’s users against the system itself. Since it may benefit some Twitter users to have an inflated number of “followers” to project an exaggerated image of their own importance, it’s not clear whether it is in the interests of the users to assist in spam detection, or to sabotage it.

Whatever the interests involved, testing for this kind of vulnerability to this “tricky ham” attack can be conducted in a similar way to the other attacks: by padding the modified data set D’ with additional samples with abnormal statistical properties (e.g noisy words and syntax), this time labeled as ham, and comparing the classifiers along normal performance metrics.

2. Analytical models of cascading behavior in networks, e.g., threshold-based or contagion-based models, are well-suited for analyzing the social dynamics in open collaboration and peer production systems. Discuss.

Cascading behavior models are well-suited to modeling information and innovation diffusion over a network. They are well-suited to analyzing peer production systems to the extent that their dynamics consist of such diffusion over a non-trivial networks. This is the case when production is highly decentralized. Whether we see peer production as centralized or not depends largely on the scale of analysis.

Narrowing in, consider the problem of recruiting new participants to an ongoing collaboration around a particular digital good, such as an open source software product or free encyclopedia. We should expect the usual cascading models to be informative about the awareness and adoption of the good. But in most cases awareness and adoption are only necessary not sufficient conditions for active participation in production. This is because, for example, contribution may involve incurring additional costs and so be subject to different constraints than merely consuming or spreading the word about a digital good.

Though threshold and contagion models could be adapted to capture some of this reluctance through higher thresholds or lower contagion rates, these models fail to closely capture the dynamics of complex collaboration because they represent the cascading behavior as homogeneous. In many open collaborative projects, contributions (and the individual costs of providing them) are specialized. Recruited participants come equipped with their unique backgrounds. (von Krogh, G., Spaeth, S. & Lakhani, K. R. “Community, joining, and specialization in open source software innovation: a case study.” (2003)) So adapting behavior cascade models to this environment would require, at minimum, parameterization of per node capacities for project contribution. The participants in complex collaboration fulfil ecological niches more than they reflect the dynamics of large networked populations.

Furthermore, at the level of a closely collaborative on-line community, network structure is often trivial. Projects may be centralized around a mailing list, source code repository, or public forum that effectively makes the communication network a large clique of all participants. Cascading behavior models will not help with analysis of these cases.

On the other hand, if we zoom out to look at open collaboration as a decentralized process–say, of all open source software developers, or of distributed joke production on Weird Twitter–then network structure becomes important again, and the effects of diffusion may dominate the internal dynamics of innovation itself. Whether or not a software developer chooses to code in Python or Ruby, for example, may well depend on a threshold of the developer’s neighbors in a communication network. These choices allow for contagious adoption of new libraries and code.

We could imagine a distributed innovation system in which every node maintained its own repository of changes, some of which it developed on its own and others it adapted from its neighbors. Maybe the network of human innovators, each drawing from their experiences and skills while developing new ones in the company of others, is like this. This view highlights the emergent social behavior of open innovation, putting the technical architecture (which may affect network structure but could otherwise be considered exogenous) in the background. (See next exam question).

My opinion is that while cascading behavior models may in decentralized conditions capture important aspects of the dynamics of peer production, the basic models will fall short because they don’t consider the interdependence of behaviors. Digital products are often designed for penetration in different networks. For example, the choice of programming language in which to implement ones project influences its potential for early adoption and recruitment. Analytic modeling of these diffusion patterns with cascade models could gain from augmenting the model with representations of technical dependency.

3. Online communities present many challenges for governance and collective behavior, especially in common pool and peer-production contexts. Discuss the relative importance and role of both (1) site architectures and (2) emergent social behaviors in online common pool and/or peer-production contexts. Your answer should draw from more than one real-world example and make specific note of key theoretical perspectives to inform your response. Your response should take approximately 2 pages.

This question requires some unpacking. The sociotechnical systems we are discussing are composed of both technical architecture (often accessed as a web site, i.e. a “location” accessed through HTTP via a web browser) and human agents interacting socially with each other in a way mediated by the architecture (though not exclusively, c.f. Coleman’s work on in person meetings in hacker communities). If technology is “a man-made means to an end” (Heidegger, Question Concerning Technology), then we can ask of the technical architecture: which man, whose end? So questioning the roles of on-line architecture and emergent behaviors brings us to look at how the technology itself was the result of emergent social behavior of its architects. For we can consider “importance” from either the perspective of the users or that of the architects. These perspectives reflect different interests and so will have different standards for evaluating the importance of its components. (c.f. Habermas, Knowledge and Human Interests)

Let us consider socio-technical systems along a spectrum between two extremes. At one extreme are certain prominent systems–e.g. Yelp and Amazon cultivating common pools of reviews–for which the architects and the users are distinct. The site architecture is a means to the ends of the architects, effected through the stimulation of user activity.

Architects acting on users through technology

Drawing on Winner (“Do artifacts have politics?”), we can see that this socio-technical arrangement establishes a particular pattern of power and authority. Architects have direct control over the technology, which enables to the limits of its affordances user activity. Users can influence architects through the information their activity generates (often collected through the medium of the technical architecture itself), but have no direct coercive control. Rather, architects design the technology to motivate certain desirable activity using inter-user feedback mechanisms such as ways of expressing gratitude or comparing one’s performance with others. (see Cheshire and Antin, “The Social Psychological Effects of Feedback on the Production of Internet Information Pools”, 2008) In such a system, users can only gain control of their technical environment by exploiting vulnerabilities in the architecture in adversarial moves that looks a bit like security breaches. (See the first exam question for an example of user-driven information sabotage.) More likely, the vast majority of users will choose to free ride on any common pool resources made available and exit the system when inconvenienced, as the environment is ultimately a transactional one of service provider and consumer.

In these circumstances, it is only by design that social behaviors lead to peer production and common pools of resources. Technology, as an expression of the interests of the architects, plays a more important role than social emergence. To clarify the point, I’d argue that Facebook, despite hosting enormous amounts of social activity, does not enable significant peer production because its main design goals are to drive the creation of proprietary user data and ad clicks. Twitter, in contrast, has from the beginning been designed as a more open platform. The information shared on it is often less personal, so activity more easily crosses the boundary from private to public, enabling collective action (see Bimber et al., “Reconceptuaizing Collective Action in the Contemporary Media Environment”, 2005) It has facilitated (with varying consistency) the creation of third party clients, as well as applications that interact with its data but can be hosted as separate sites.

This open architecture is necessary but not sufficient for emergent common pool behavior. But the design for open possibilities is significant. It enables the development of novel, intersecting architectures to support the creation of new common pools. Taking Weird Twitter, framed as a peer production community for high quality tweets, as an example, we can see how the service Favstar (which aggregates and ranks tweets that have been highly “starred” and retweeted, and awards congratulatory tweets as prizes) provides historical reminders and relative rankings of tweet quality. Thereby facilitates a culture of production. Once formed, such a culture can spread and make use of other available architecture as well. Weird Twitter has inspired Twitter: The Comic, a Tumblr account illustrating “the greatest tweets of our generation.”

Consider another extreme case, the free software community that Kelty identifies as the recursive public. (Two Bits: The Cultural Significance of Free Software) In an idealized model, we could say that in this socio-technical system the architects and the users are the same.

Recursive public diagram

The artifacts of the recursive public have a different politics than those at the other end of our spectrum, because the coercive aspects of the architectural design are the consequences of the emergent social behavior of those affected by it. Consequently, technology created in this way is rarely restrictive of productive potential, but on the contrary is designed to further empower the collaborative communities that produced it. The history of Unix, Mozilla, Emacs, version control systems, issue tracking software, Wikimedia, and the rest can be read as the historical unfolding of the human interest in an alternative, emancipated form of production. Here, the emergent social behavior claims its importance over and above the particulars of the technology itself.

Reinventing wheels with Dissertron

I’ve found a vehicle for working on the Dissertron through the website of the course I’ll be co-teaching this Fall on Open Collaboration and Peer Production.

In the end, I went with Pelican, not Hyde. It was a difficult decision (because I did like the idea of supporting an open source project coming out of Chennai, especially after reading Coding Places. But I had it on good authority that Pelican was more featureful with cleaner code. So here I am.

The features I have in mind are crystalizing as I explore the landscape of existing tools more. This is my new list:

  • Automatically include academic metadata on each Dissertron page so it’s easy to slurp it into Zotero.
  • Include the Hypothes.is widget for annotations. I think Hypothes.is will be better for commenting that Disqus because it does annotations in-line, as opposed to comments in the footer. It also uses the emerging W3C Open Annotation standard. I’d like this to be as standards based as possible.
  • Use citeproc-js to render citations in the browser cleanly. I think this handles the issue of in-line linked academic citations without requiring a lot of manual work. The citeproc-js looks like it’s come out of Zotero as well. Since Elsevier bought Mendeley, Zotero seems like the more reliable ally to pick for independent scholarship.
  • Trickiest is going to be porting a lot of features from jekyll-scholar into a Pelican plug-in. I really want jekyll-scholar‘s bibliographic management. But I’m a little worried that Pelican isn’t well-designed for that sort of flexibility in theming. More soon.
  • I’m interested in trying to get the HTML output of Dissertron as close as possible to emerging de facto standards on what on-line scholarship is like. I’ve asked about what PLOS ONE does about this. The answer sounds way complicated: a tool chain the goes from Latex to Word Docs to NLM 3.0 XML (which I didn’t even know was a thing), and at last into HTML. I’m trying to start from Markdown because I think it’s a simple markup language for the future, but I’m not deep enough in that tool chain to understand how to replicate its idiosyncracies.

If I could have all these nice things, and maybe a pony, then I would be happy and have no more excuses for not actually doing research, as opposed to obsessing about the tooling around independent publishing.

Another rant about academia and open source

A few weeks ago I went to a great talk by Victoria Stodden about how there’s a crisis of confidence in scientific research that depends on heavy computing. Long story short, because the data and code aren’t openly available, the results aren’t reproducible. That means there’s no check on prior research, and bad results can slip through and be the foundation for future work. This is bad.

Stodden’s solution was to push forward within the scientific community and possibly in legislation (i.e., as a requirement on state-funded research) for open data and code in research. Right on!

Then, something intriguing: somebody in the audience asked how this relates to open source development. Stodden, who just couldn’t stop saying amazing things that needed to be said that day, answered by saying that scientists have a lot to learn from the “open source world”, because they know how to build strong communities around their (open) work.

Looking around the room at this point, I saw several scientists toying with their laptops. I don’t think they were listening.

It’s a difficult thing coming from an open source background and entering academia, because the norms are close, but off.

The other day I wrote in an informal departmental mailing list a criticism and questions about a theorist with a lot of influence in the department, Bruno Latour. There were a lot of reactions to that thread that ranged pretty much all across the board, but one of the surprising reactions I got was along the lines of “I’m not going to do your work for you by answering your question about Latour.” In other words, RTFM. Except, in this case, “the manual” was a book or two of dense academic literature in a field that I was just beginning to dip into.

I don’t want to make too much of this response, since there were a lot of extenuating circumstances, but it did strike me as an indication of one of the cultural divides between open source development and academic scholarship. In the former, you want as many people as possible to understand and use your cool new thing because that enriches your community and makes your feel better about your contribution to the world. For some kinds of scholars, being the only one who understands a thing is a kind of distinction that gives you pride and job opportunities, so you don’t really want other people to know as much as you about it.

Similarly for computationally heavy sciences: if you think your job is to get grants to fund your research, you don’t really want anybody picking through it and telling you your methodology was busted. In an Internet Security course this semester, I’ve had the pleasure of reading John McHugh’s Testing Intrusion Detection Systems: A Critique of the 1998 and 1999 DARPA Off-line Intrusion Detection System Evaluation as Performed by Lincoln Laboratory. In this incredible paper, McHugh explains why a particular DARPA-funded Lincoln Labs Intrusion Detection research paper is BS, scientifically speaking.

In open source development, we would call McHugh’s paper a bug report. We would say, “McHugh is a great user of our research because he went through and tested for all these bugs, and even has recommendations about how to fix them. This is fantastic! The next release is going to be great.”

In the world of security research, Lincoln Labs complained to the publisher and got the article pulled.

Ok, so security research is a new field with a lot of tough phenomena to deal with and not a ton of time to read up on 300 years of epistemology, philosophy of science, statistical learning theory, or each others’ methodological critiques. I’m not faulting the research community at all. However, it does show some of the trouble that happens in a field that is born out of industry and military funding concerns without the pretensions or emphasis on reproducible truth-discovery that you get in, say, physics.

All of this, it so happens, is what Lyotard describes in his monograph, The Postmodern Condition (1979). Lyotard argues that because of cybernetics and information technologies, because of Wittgenstein, because of the “collapse of metanarratives” that would make anybody believe in anything silly like “truth”, there’s nothing left to legitimize knowledge except Winning.

You can win in two ways: you can research something that helps somebody beat somebody else up or consume more, so that they give you funding. Or you can win by not losing, by pulling some wild theoretical stunt that puts you out of range of everybody else so that they can’t come after you. You become good at critiquing things in ways that sound smart, and tell people who disagree with you that they haven’t read your cannon. You hope that if they call your bluff and read it, they will be so converted by the experience that they will leave you alone.

Some, but certainly not all, of academia seems like this. You can still find people around who believe in epistemic standards: rational deduction, dialectical critique resolving to a consensus, sound statistical induction. Often people will see these as just a kind of meta-methodology in service to a purely pragmatic ideal of something that works well or looks pretty or makes you think in a new way, but that in itself isn’t so bad. Not everybody should be anal about methodology.

But these standards are in tension with the day to day of things, because almost nobody really believes that they are after true ideas any more. It’s so easy to be cynical or territorial.

What seems to be missing is a sense of common purpose in academic work. Maybe it’s the publication incentive structure, maybe it’s because academia is an ideological proxy for class or sex warfare, maybe it’s because of a lot of big egos, maybe it’s the collapse of meta-narratives.

In FOSS development, there’s a secret ethic that’s not particularly well articulated by either the Free Software Movement or the Open Source Initiative, but which I believe is shared by a lot of developers. It goes something like this:

I’m going to try to build a totally great new thing. It’s going to be a lot of work, but it will be worth it because it’s going to be so useful and cool. Gosh, it would be helpful if other people worked on it with me, because this is a lonely pursuit and having others work with me will help me know I’m not chasing after a windmill. If somebody wants to work on it with me, I’m going to try hard to give them what they need to work on it. But hell, even if somebody tells me they used it and found six problems in it, that’s motivating; that gives me something to strive for. It means I have (or had) a user. Users are awesome; they make my heart swell with pride. Also, bonus, having lots of users means people want to pay me for services or hire me or let me give talks. But it’s not like I’m trying to keep others out of this game, because there is just so much that I wish we could build and not enough time! Come on! Let’s build the future together!

I think this is the sort of ethic that leads to the kind of community building that Stodden was talking about. It requires a leap of faith: that your generosity will pay off and that the world won’t run out of problems to be solved. It requires self-confidence because you have to believe that you have something (even something small) to offer that will make you a respected part of an open community without walls to shelter you from criticism. But this ethic is the relentlessly spreading meme of the 21st century and it’s probably going to be victorious by the start of the 22nd. So if we want our academic work to have staying power we better get on this wagon early so we can benefit from the centrality effects in the growing openly collaborative academic network.

I heard David Weinberger give a talk last year on his new book Too Big to Know, in which he argued that “the next Darwin” was going to be actively involved in social media as a research methodology. Tracing their research notes will involve an examination of their inbox and facebook feed to see what conversations were happening, because just so much knowledge transfer is happening socially and digitally and it’s faster and more contextual than somebody spending a weekend alone reading books in a library. He’s right, except maybe for one thing, which is that this digital dialectic (or pluralectic) implies that “the next Darwin” isn’t just one dude, Darwin, with his own ‘-ism’ and pernicious Social adherents. Rather, it means that the next great theory of the origin of species is going to be built by a massive collaborative effort in which lots of people will take an active part. The historical record will show their contributions not just with the clumsy granularity of conference publications and citations, but with minute granularity of thousands of traced conversations. The theory itself will probably be too complicated for any one person to understand, but that’s OK, because it will be well architected and there will be plenty of domain experts to go to if anyone has problems with any particular part of it. And it will be growing all the time and maybe competing with a few other theories. For a while people might have to dual boot their brains until somebody figures out how to virtualize Foucauldean Quantum Mechanics on a Organic Data Splicing ideological platform, but one day some crazy scholar-hacker will find a way.

“Cool!” they will say, throwing a few bucks towards the Kickstarter project for a musical instrument that plays to the tune of the uncollapsed probabilistic power dynamics playing out between our collated heartbeats.

Does that future sound good? Good. Because it’s already starting. It’s just an evolution of the way things have always been, and I’m pretty sure based on what I’ve been hearing that it’s a way of doing things that’s picking of steam. It’s just not “normal” yet. Generation gap, maybe. That’s cool. At the rate things are changing, it will be here before you know it.