Digifesto

Reflections on the Berkeley Institute for Data Science (BIDS) Launch

Last week was the launch of the Berkeley Institute for Data Science.

Whatever might actually happen as a result of the launch, what was said at the launch was epic.

Vice Chancellor of research Graham Flemming introduced Chancellor Nicholas Dirks for the welcoming remarks. Dirks is UC Berkeley’s 10th Chancellor. He succeeded Robert Birgeneau, who resigned gracefully shortly after coming under heavy criticism for his handling of Occupy Cal, the Berkeley campus’ chapter of the Occupy movement. He was distinctly unsympathetic to the protesters, and there was a widely circulated petition declaring a lack of confidence in his leadership. Birgeneau is a physicist. Dirks is an anthropologist who has championed postcolonial approaches. Within the politics of the university, which are a microcosm of politics at large, this signalling is clear. Dirks’ appointment was meant to satisfy the left wing protesters, most of whom have been trained in softer social sciences themselves. Critical reflection on power dynamics and engagement in activism–which is often associated with leftist politics–are, at least formally, accepted by the university administration as legitimate. Birgeneau would subsequently receive awards for his leadership in drawing more women into the sciences and aiding undocumented students.

Dirks’ welcoming remarks were about the great accomplishments of UC Berkeley as a research institution the vague but extraordinary potential of BIDS. He is grateful, as we all are, for the funding from the Moore and Sloan foundations. I found his remarks unspecific, and I couldn’t help but wonder what his true thoughts were about data science in the university. Surely he must have an opinion. As an anthropologist, can he consistently believe that data science, especially in the social sciences, is the future?

Vicki Chandler, Chief Program Officer from the Moore Foundation, was more lively. Pulling no punches, she explained that the purpose of BIDS is to shake up scientific culture. Having hung out in Berkeley in the 60′s and attended it as an undergraduate in the 70′s, she believes we are up for it. She spoke again and again of “revolution”. There is ambiguity in this. In my experience, faculty are divided on whether they see the proposed “open science” changes as imminent or hype, as desirable or dangerous. More and more I see faculty acknowledge that we are witnessing the collapse of the ivory tower. It is possible that the BIDS launch is a tipping point. What next? “Let the fun begin!” concluded Chandler.

Saul Perlmutter, Nobel laureate physicist and front man of the BIDS co-PI super group, gave his now practiced and condensed pitch for the new Institute. He hit all the high points, pointing out not only the potential of data science but the importance of changing the institutions themselves. Rethinking the peer-review journal from scratch, he said, we should focus more on code reuse. Software can be a valid research output. As much as open science is popular among the new generation of scientists, this is a bold statement for somebody with such credibility within the university. He even said that the success of open source software is what gives us hope for the revolutionary new kind of science BIDS is beginning. Two years ago, this was a fringe idea. Perlmutter may have just made it mainstream.

Notably, he also engaged with the touchy academic politics, saying that data science could bring diversity to the sciences (though he was unspecific about the mechanism for this). He expounded on the important role of ethnography in evaluating the Institute to identify the bottlenecks to its unlocking its potential.

The man has won at physics and is undoubtedly a scientist par excellance. Perhaps Perlmutter sees the next part of his legacy as the bringing of the university system into the 21st century.

David Culler, Chair of the Electrical Engineering and Computer Science department, then introduced a number of academic scientists, each with impressive demonstrations about how data science could be applied to important problems like climate change and disaster reduction. Much of this research depends on using the proliferation of hand-held mobile devices as sensors. University science, I realized while watching this, is at its best when doing basic research about how to save humanity from nature or ourselves.

But for me the most interesting speakers in the first half of the launch were luminaries Peter Norvig and Tim O’Reilly, each giants in their own right and welcome guests to the university.

Culler introduced Norvig, Director of Research at Google, by crediting him as one of the inventors of the MOOC. I know his name mainly as a co-author of “Artificial Intelligence: A Modern Approach,” which I learned and taught from as an undergraduate. Amazingly, Norvig’s main message is about the economics of the digital economy. Marginal production is cheap, cost of communication is cheap, and this leads to an accumulation of wealth. Fifty percent of jobs are predicted to be automated away in the coming decades. He is worried about the 99%–freely using Occupy rhetoric. What will become of them? Norvig’s solution, perhaps stated tongue in cheek, is that everyone needs to become a data scientist. More concretely, he has high hopes for hybrid teams of people and machines, that all professions will become like this. By defining what academic data science looks like and training the next generation of researchers, BIDS will have a role in steering the balance of power between humanity and the machines–and the elite few who own them.

His remarks hit home. He touched on anxieties that are as old as the Industrial Revolution: is somebody getting immensely rich off of these transformations, but not me? What will my role be in this transformed reality? Will I find work? These are real problems and Norvig was brave to bring them up. The academics in the room were not immune from these anxieties either, as they watch the ivory tower crumble around them. This would come up again later in the day.

I admire him for bringing up the point, and I believe he is sincere. I’d heard him make the same points when he was on a panel with Neil Stephenson and Jaron Lanier a month or so earlier. I can’t help but be critical of Norvig’s remarks. Is he covering his back? Many university professors are seeing MOOCs themselves as threatening to their own careers. It is encouraging that he sees the importance of hybrid human/machine teams. If the machines are built on Google infrastructure, doesn’t this contribute to the same inequality he laments, shifting power away from teachers to the 1% at Google? Or does he foresee a MOOC-based educational boom?

He did not raise the possibility that human/machine hybridity is already the status quo–that, for example, all information workers tap away at these machines and communicate with each other through a vast technical network. If he had acknowledged that we are all cyborgs already, he would have had to admit that hybrid teams of humans and machines are as much the cause of as solution to economic inequality. Indeed, this relationship between human labor and mechanical capital is precisely the same as the one that created economic inequality in the Industrial Revolution. When the capital is privately owned, the systems of hybrid human/machine productivity favor the owner of the machines.

I have high hopes that BIDS will address through its research Norvig’s political concern. It is certainly on the mind of some of its co-PI’s, as later discussion would show. But to address the problem seriously, it will have to look at the problem in a rigorous way that doesn’t shy away from criticism of the status quo.

The next speaker, Tim O’Reilly, is a figure who fascinates me. Culler introduced him as a “God of the Open Source Field,” which is poetically accurate. Before coming to academia, I worked on Web 2.0 open source software platforms for open government. My career was defined by a string of terms invented and popularized by O’Reilly, and to a large extent I’m still a devotee of his ideas. But as a practitioner and researcher, I’ve developed a nuanced view of the field that I’ve tried to convey in the course on Open Collaboration and Peer Production I’ve co-instructed with Thomas Maillart this semeser.

O’Reilly came under criticism earlier this year from Evgeny Morozov, who attacked him for marketing politically unctuous ideas while claiming to be revolutionary. He focuses on his promotion of ‘open source’ over and against Richard Stallman’s explicitly ethical and therefore contentious term ‘free software‘. Morozov accuses O’Reilly of what Tom Scocca has recently defined as rhetorical smarm–dodging specific criticism by denying the appropriateness of criticism in general. O’Reilly has disputed the Morozov piece. Elsewhere he has presented his strategy as a ‘marketer of big ideas‘, and his deliberate promoting of more business-friendly ‘open source’ rhetoric. This ideological debate is itself quite interesting. Geek anthropologist Chris Kelty observes that it is participation in this debate, more so than an adherence to any particular view in it, that characterizes the larger “movement,” which he names the recursive public.

Despite his significance to me, with an open source software background, I was originally surprised when I heard Tim O’Reilly would be speaking at the BIDS launch. O’Reilly had promoted ‘open source’ and ‘Web 2.0′ and ‘open government’, but what did that have to do with ‘data science’?

So I was amused when Norvig introduced O’Reilly by saying that he didn’t know he was a data scientist until the latter wrote an article in Forbes (in November 2011) naming him one of “The World’s 7 Most Powerful Data Scientists.” Looking at the Google Trends data, we can see that November 2011 just about marks the rise of ‘data science’ from obscurity to popularity. Is Tim O’Reilly responsible for the rise of ‘data science’?

Perhaps. O’Reilly’s explained that he got into data science by thinking about the end game for open source. As open source software becomes commodified (which for him I think means something like ‘subject to competitive market pressure), what becomes valuable is the data. And so he has been promoting data science in industry and government, and believes that the university can learn important lessons from those fields as well. He held up his Moto X phone, explained how it is ‘always listening’ and so can facilitate services like Google Now. All this would go towards a system with greater collective intelligence, a self-regulating system that would make regulators obsolete.

Looking at the progression of the use of maps, from paper to digital to being embedded in services and products like self-driving cars, O’Reilly agrees with Norvig about the importance of human-machine interaction. In particular, he believes that data scientists will need to know how to ask the right questions about data, and that this is the future of science. “Others will be left behind,” he said, not intending to sound foreboding.

I thought O’Reilly presented the combination of insight and boosterism I expected. To me, his presence at the BIDS launch meant to me that O’Reilly’s significance as a public intellectual has progressed from business through governance and now to scientific thinking itself. This is wonderful for him but means that his writings and influence should be put under the scrutiny we would have for an academic peer. It is appropriate to call him out for glossing over the privacy issues around a mobile phone that is “always listening,” or the moral implications of the obsolescence of regulators for equality and justice. Is his objectivity compromised by the fact that he runs a publishing company that sells complementary goods to the vast supply of publicly available software and data? Does his business agenda incentivize him to obscure the subtle differences between various segements of his market? Are we in the university victims of that obscurity as we grapple with multiple conflated meanings of “openness” in software and science (open to scrutiny and accountability, vs. open for appropriation by business, vs. open to meritocratic contribution)? As we ask these questions, we can be grateful to O’Reilly for getting us this far.

I’ve emphasized the talks given by Norvig and O’Reilly because they exposed what I think are some of the most interesting aspects of BIDS. One way or another, it will be revolutionary. Its funders will be very disappointed if it is not. But exactly how it is revolutionary is undetermined. The fact that BIDS is based in Berkeley, and not in Google or Microsoft or Stanford, guarantees that the revolution will not be an insipid or smarmy one which brushes aside political conflict or morality. Rather, it promises to be the site of fecund political conflict. “Let the fun begin!” said Chandler.

The opening remarks concluded and we broke for lunch and poster sessions–the Data Science Faire (named after O’Reilly’s Maker Faire…

What followed was a fascinating panel discussion led by astrophysicist Josh Bloom, historian and university administrator Cathryn Carson, computer science professor and AMP Lab director Michael Franklin, and Deb Agrawal, a staff computer scientist for Lawrence Berkeley National Lab.

Bloom introduced the discussion jokingly as “just being among us scientists…and whoever is watching out there on the Internet,” perhaps nodding to the fact that the scientific community is not yet fully conscious that their expectations of privileged communication are being challenged by a world and culture of mobile devices that are “always listening.”

The conversation was about the role of people in data science.

Carson spoke as a domain scientist–a social scientist who studies scientists. Noting that social scientists tend to work in small teams lead by graduate students motivated by their particular questions, she said her emphasis was on the people asking questions. Agrawal noted that the number of people needed to analyze a data set does not scale with the size of data, but the complexity of data–a practical point. (I’d argue that theoretically we might want to consider “size” of data in terms of its compressibility–which would reflect its complexity. This ignores a number of operational challenges.) For Franklin, people are a computational resource that can be part of a crowd-sourced process. In that context, the number of people needed does indeed scale with the use of people as data processors and sensors.

Perhaps to follow through on Norvig’s line of reasoning, Bloom then asked pointedly if machines would ever be able to do the asking of questions better than human beings. In effect: Would data science make data scientists obsolete?

Nobody wanted to be the first to answer this question. Bloom had to repeat it.

Agrawal took a first stab at it. The science does not come from the data; the scientist chooses models and tests them. This is the work of people. Franklin agreed and elaborated–the wrong data too early can ruin the science. Agrawal noted that computers might find spurious signals in the noise.

Personally, I find these unconvincing answers to Bloom’s question. Algorithms can generate, compare, and test alternative models against the evidence. Noise can, with enough data, be filtered away from the signal. To do so pushes the theoretical limits of computing and information theory, but if Franklin is correct in his earlier point that people are part of the computational process, then there is no reason in principle why these tasks too might not be performed if not assisted by computers.

Carson, who had been holding back her answer to listen to the others, had a bolder proposal: rather than try to predict the future of science, why not focus on the task of building that future?

In another universe, at that moment someone might have asked the one question no computer could have answered. “If we are building the new future of science, what should we build? What should it look like? And how do we get there?” But this is the sort of question disciplined scientists are trained not to ask.

Instead, Bloom brought things back to practicality: we need to predict where science will go in order to know how to educate the next generation of scientists. Should we be focusing on teaching them domain knowledge, or on techniques?

We have at the heart of BIDS the very fundamental problem of free will. Bloom suggests that if we can predict the future, then we can train students in anticipation of it. He is an astrophysics and studies stars; he can be forgiven for the assumption that bodies travel in robust orbits. This environment is a more complex one. How we choose to train students now will undoubtedly affect how science evolves, as the process of science is at once the process of learning and training new scientists. His descriptive question then falls back to the normative one: what science are we trying to build toward?

Carson was less heavy-handed than I would have been in her position. Instead, she asked Bloom how he got interested in data science. Bloom recalled his classical physics training, and the moment he discovered that to answer the kinds of questions he was asking, he would need new methods.

Franklin chimed in on the subject of education. He has heard it said that everyone in the next generation should learn to code. With marked humility for his discipline, he said he did not agree with this. But he said he did believe that everyone in the next generation should learn data literacy, echoing Norvig.

Bloom opened the discussion to questions from the audience.

The first was about the career paths for methodologists who write software instead of papers. How would BIDS serve them? It was a soft ball question which the panel hit out of the park. Bloom noted that the Moore and Sloan funders explicitly asked for the development of alternative metrics to measure the impact of methodologist contributions. Carson said that even with the development of metrics, as an administrator she knew it would be a long march through the institution to get those metrics recognized. There was much work to be done. “Universities got to change,” she rallied. “If we don’t change, Berkeley’s being great in the past won’t make it great in the future,” referring perhaps to the impressive history of research recounted by Chancellor Dirks. There was applause. Franklin pointed out that the open source community has its own metrics already. In some circles some of his students are more famous than he is for developing widely used software. Investors are often asking him when his students will graduate. The future, it seems, is bright for methodologists.

At this point I lost my Internet connection and had to stop livetweeting the panel; those tweets are the notes from which I am writing these reflections. Recalling from memory, there was one more question from Kristina Kangas, a PhD student in Integrative Biology. She cited research about how researchers interpreting data wind up reflecting back their own biases. What did this mean for data science?

Bloom gave Carson the last word. It is a social scientific fact, she said, that scientists interpret data in ways that fit their own views. So it’s possible that there is no such thing as “data literacy”. These are open questions that will need to be settled by debate. Indeed, what then is data science after all? Turning to Bloom, she said, “I told you I would be making trouble.”

Reflexive data science

In anticipation of my dissertation research and in an attempt to start a conversation within the emerging data science community at Berkeley, I’m working on a series of blog posts about reflexive data science. I will update this post with an index of them and related pieces as they are published over time.

“Reflexive data science: an overview”, UC Berkeley D-Lab Blog.
Explaining how the stated goals of the Berkeley Institute of Data Science–open source, open science, alt-metrics, and empirical evaluation–imply the possibility of an iterative, scientific approach to incentivizing scientists.

notes on innovation in journalism

I’ve spent the better part of the past week thinking hard about journalism. This is due largely to two projects: further investigation into Weird Twitter, and consulting work I’ve been doing with the Center for Investigative Reporting. Journalism, the trope goes, is a presently disrupted industry. I’d say it’s fair to say it’s a growing research interest for me. So here’s the rundown on where things seem to be at.

Probably the most rewarding thing to come out of the fundamentally pointless task of studying Weird Twitter, besides hilarity, is getting a better sense of the digital journalism community. I’ve owed Ethnography Matters a part 2 for a while, and it seems like the meatiest bone to pick is still on the subject of attention economy. The @horse_ebooks/Buzzfeed connection drives that nail in deeper.

I find content farming pretty depressing and only got more depressed reading Dylan Love’s review of MobileWorks that he crowdsourced to crowdworkers using MobileWorks. I mean, can you think of a more dystopian world than one in which the press is dominated by mercenary crowdworkers pulling together plausible-sounding articles out of nowhere for the highest bidder.

I was feeling like the world was going to hell until somebody told me about Oximity, which is a citizen journalist platform, as opposed to a viral advertising platform. Naturally, this has a different flavor to it, though is less monetized/usable/populated. Hmm.

I spend too much time on the Internet. That was obvious when attending CIR’s Dissection:Impact events on Wednesday and Thursday. CIR is a foundation-funded non-profit that actually goes and investigates things like prisons, migrant farm workers, and rehab clinics. The people there really turned my view of things around, as I realized that there are still people out there dedicated to using journalism to do good in the world.

There were three interesting presentations with divergent themes.

One was a presentation of ConText, a natural language and network processing toolkit for analyzing the discussion around media. It was led by Jana Deisner at the I School at Urbana-Champaign. Her dissertation work was on covert network analysis to detect white collar criminals. They have a thoroughly researched impact model, and software is currently unusable by humans but combines best practices in text and network analysis. The intend to release it as an academic tool for researchers, open source.

Another was a presentation by Harmony Institute, which has high-profile clients like MTV. Their lead designer walked us through a series of compelling mockups of ImpactSpace, an impact analysis tool that shows the discussion around an issue as “constellations” through different “solar systems” of ideas. Their project promises to identify how one can frame a story to target swing viewers. But they were not specific about how they would get and process the data. They intend to make demos of thir service available on-line, and market it as a product.

The third presentation was by CIR itself, which has hired a political science post-doc to come up with an analysis framework. They focused on a story, “Rape in the Fields”, about sexual abuse of migrant farm workers. These people tend not to be on Twitter, but the story was a huge success on Univision. Drawing mainly on qualitative data, it considers “micro”, “mezo”, and “macro” impact. Micro interactions might be eager calls to the original journalist for more information, or powerful anectdotes of how somebody hurt had felt healed when they were able to tell their story to the world.

Each team has their disciplinary bias and their own strengths and weaknesses. But they are tackling the same problem: trying to evaluate the effectiveness of media. They know that data is powerful: CIR uses it all the time to find stories. They will sift through a large data set, look for anomalies, and then carefully investigate. But even when collaborative science, including “data science” components, is effectively used to do external facing research, the story gets more difficult, intellectually and politically, when it turns that kind of thinking reflexively on itself.

I think this story sounds a lot like the story of what’s happening in Berkeley. A disrupted research organization struggles to understand its role in a changing world under pressure to adapt to data that seems both ubiquitous and impoverished.
Does this make you buy into the connection between universities and journalism?

If it does, then I can tell you another story about how software ties in. If not, then I’ve got deeper problems.

There is an operational tie: D-Lab and CIR have been in conversation about how to join forces. With the dissolution of disciplines, investigative reporting is looking more and more like social science. But its the journalists who are masters of distribution and engagement. What can we learn about the imoact of social science research from journalists? And how might the two be better operationally linked?

The New School sent some folks to the Dissection event to talk about the Open Journalism program they are starting soon.

I asked somebody at CIR what he thought about Buzzfeed. He explained that it’s the same business model as HuffPo–funding real journalism with the revenue from the crappy clickbait. I hope that’s true. I wonder if they would suffer as a business if they only put out clickbait. Is good journalism anything other than clickbait for the narrow segment of the population that has expensive taste in news?

The most interesting conversation I had was with Mike Corey at CIR, who explained that there are always lots of great stories, but that the problem was that newspapers don’t have space to run all the stories, they are an information bottleneck. I found this striking because I don’t get my media from newspapers any more, and it revealed that the shifting of the journalism ecosystem is still underway. Thinking this through…

In the old model, a newspaper (or radio show, or TV show) had limited budget to distributed information, and so competed for prestige with creativity and curational prowess. Naturally they targeted different audiences, but there was more at stake in deciding what to and what not to report. (The unintentional past tense here just goes to show where I am in time, I guess.)

With web publishing, everybody can blog or tweet. What’s newsworthy is what gets sifted through and picked up. Moreover, this can be done experimentally on a larger scale than…ah, interesting. Ok, so individual reporters wind up building a social media presence that is effectively a mini-newspaper and…oh dear.

One of the interesting phrases that came out of the discussion at the Dissection event was “self-commodification”–the tendency of journalists to need to brand themselves as products, artists, performers. Watching journalists on Twitter is striking partly because of how these constraints affect their behavior.

Putting it another way: what if newspapers had unlimited paper on which to print things? How would they decide to sort and distribute information? This is effectively what the Gawker, Buzzfeed, Techcrunch, and all the rest of the web press is up to. Hell, it’s what the Wall Street Journal is up to, as older more prestigious brands are pressured to compete. This causes the much lamented decline in the quality of journalism.

Ok, ok, so what does any of this mean? For society, for business. What is the equilibrium state?

Sample UC Berkeley School of Information Preliminary Exam

I’m in the PhD program at UC Berkeley’s School of Information. Today, I had to turn in my Preliminary Exam, a 24-hour open book, open note examination on the chosen subject areas of my coursework. I got to pick an exam committee of three faculty members, one for each area of speciality. My committee consisted of: Doug Tygar, examining me on Information System Design; John Chuang, the committee chair, examining me on Information Economics and Policy; and Coye Cheshire, examining me on Social Aspects of Information. Each asked me a question corresponding to their domain; generously, they targeted their questions at my interests.

In keeping with my personal policy of keeping my research open, and because I learned while taking the exam the unvielling of @horse_ebooks and couldn’t resist working it into the exam, and because maybe somebody enrolled in or thinking about applying for our PhD program might find it interesting, I’m posting my examination here (with some webifying of links).

At the time of this posting, I don’t yet know if I have passed.

1. Some e-mail spam detectors use statistical machine learning methods to continuously retrain a classifier based on user input (marking messages as spam or ham). These systems have been criticized for being vulnerable to mistraining by a skilled adversary who sends “tricky spam” that causes the classifier to be poisoned. Exam question: Propose tests that can determine how vulnerable a spam detector is to such manipulation. (Please limit your answer to two pages.)

Tests for classifier poisoning vulnerability in statistical spam filtering systems can consist of simulating particular attacks that would exploit these vulnerabilities. Many of these tests are described in Graham-Cumming, “Does Bayesian Poisoning exist?”, 2006 [pdf.gz], including:

  • For classifiers trained on a “natural” training data set D and a modified training data set D’ that has been generated to include more common words in messages labeled as spam, compare specificity, sensitivity, or more generally the ROC plots of each for performance. This simulates an attack that aims to increase the false positive rate by making words common to hammy messages be evaluated as spammy.
  • Same as above, but construct D’ to include many spam messages with unique words. This exploits a tendency in some Bayesian spam filters to measure the spamminess of a word by the percentage of spam messages that contain it. If successful, the attack dilutes the classifier’s sensitivity to spam over a variety of nonsense features, allowing more mundane spam to get through the filter as false negatives.

These two tests depend on increasing the number of spam messages in the data set in a way that strategically biases the classifier. This is the most common form of mistraining attack. Interestingly, these attacks assume that users will correctly label the poisoning messages as spam. So these attacks depend on weaknesses in the filter’s feature model and improper calibration to feature frequency.

A more devious attack of this kind would depend on deceiving the users of the filtering system to mislabel spam as ham or, more dramatically, acknowledge true ham that drives up the hamminess of features normally found in spam.

An example of an attack of this kind (though perhaps not intended as an attack per se) is @Horse_ebooks, a Twitter account that gained popularity while posting randomly chosen bits of prose and, only occasionally, links to purchase low quality self-help ebooks. Allegedly, it was originally a spam bot engaged in a poisoning/evasion attack, but developed a cult following who appreciated its absurdist poetic style. Its success (which only grew after the account was purchased by New York based performance artist Jacob Bakkila in 2011) inspired an imitative style of Twitter activity.

Assuming Twitter is retraining on this data, this behavior could be seen as a kind of poisoning attack, albeit by filter’s users against the system itself. Since it may benefit some Twitter users to have an inflated number of “followers” to project an exaggerated image of their own importance, it’s not clear whether it is in the interests of the users to assist in spam detection, or to sabotage it.

Whatever the interests involved, testing for this kind of vulnerability to this “tricky ham” attack can be conducted in a similar way to the other attacks: by padding the modified data set D’ with additional samples with abnormal statistical properties (e.g noisy words and syntax), this time labeled as ham, and comparing the classifiers along normal performance metrics.

2. Analytical models of cascading behavior in networks, e.g., threshold-based or contagion-based models, are well-suited for analyzing the social dynamics in open collaboration and peer production systems. Discuss.

Cascading behavior models are well-suited to modeling information and innovation diffusion over a network. They are well-suited to analyzing peer production systems to the extent that their dynamics consist of such diffusion over a non-trivial networks. This is the case when production is highly decentralized. Whether we see peer production as centralized or not depends largely on the scale of analysis.

Narrowing in, consider the problem of recruiting new participants to an ongoing collaboration around a particular digital good, such as an open source software product or free encyclopedia. We should expect the usual cascading models to be informative about the awareness and adoption of the good. But in most cases awareness and adoption are only necessary not sufficient conditions for active participation in production. This is because, for example, contribution may involve incurring additional costs and so be subject to different constraints than merely consuming or spreading the word about a digital good.

Though threshold and contagion models could be adapted to capture some of this reluctance through higher thresholds or lower contagion rates, these models fail to closely capture the dynamics of complex collaboration because they represent the cascading behavior as homogeneous. In many open collaborative projects, contributions (and the individual costs of providing them) are specialized. Recruited participants come equipped with their unique backgrounds. (von Krogh, G., Spaeth, S. & Lakhani, K. R. “Community, joining, and specialization in open source software innovation: a case study.” (2003)) So adapting behavior cascade models to this environment would require, at minimum, parameterization of per node capacities for project contribution. The participants in complex collaboration fulfil ecological niches more than they reflect the dynamics of large networked populations.

Furthermore, at the level of a closely collaborative on-line community, network structure is often trivial. Projects may be centralized around a mailing list, source code repository, or public forum that effectively makes the communication network a large clique of all participants. Cascading behavior models will not help with analysis of these cases.

On the other hand, if we zoom out to look at open collaboration as a decentralized process–say, of all open source software developers, or of distributed joke production on Weird Twitter–then network structure becomes important again, and the effects of diffusion may dominate the internal dynamics of innovation itself. Whether or not a software developer chooses to code in Python or Ruby, for example, may well depend on a threshold of the developer’s neighbors in a communication network. These choices allow for contagious adoption of new libraries and code.

We could imagine a distributed innovation system in which every node maintained its own repository of changes, some of which it developed on its own and others it adapted from its neighbors. Maybe the network of human innovators, each drawing from their experiences and skills while developing new ones in the company of others, is like this. This view highlights the emergent social behavior of open innovation, putting the technical architecture (which may affect network structure but could otherwise be considered exogenous) in the background. (See next exam question).

My opinion is that while cascading behavior models may in decentralized conditions capture important aspects of the dynamics of peer production, the basic models will fall short because they don’t consider the interdependence of behaviors. Digital products are often designed for penetration in different networks. For example, the choice of programming language in which to implement ones project influences its potential for early adoption and recruitment. Analytic modeling of these diffusion patterns with cascade models could gain from augmenting the model with representations of technical dependency.

3. Online communities present many challenges for governance and collective behavior, especially in common pool and peer-production contexts. Discuss the relative importance and role of both (1) site architectures and (2) emergent social behaviors in online common pool and/or peer-production contexts. Your answer should draw from more than one real-world example and make specific note of key theoretical perspectives to inform your response. Your response should take approximately 2 pages.

This question requires some unpacking. The sociotechnical systems we are discussing are composed of both technical architecture (often accessed as a web site, i.e. a “location” accessed through HTTP via a web browser) and human agents interacting socially with each other in a way mediated by the architecture (though not exclusively, c.f. Coleman’s work on in person meetings in hacker communities). If technology is “a man-made means to an end” (Heidegger, Question Concerning Technology), then we can ask of the technical architecture: which man, whose end? So questioning the roles of on-line architecture and emergent behaviors brings us to look at how the technology itself was the result of emergent social behavior of its architects. For we can consider “importance” from either the perspective of the users or that of the architects. These perspectives reflect different interests and so will have different standards for evaluating the importance of its components. (c.f. Habermas, Knowledge and Human Interests)

Let us consider socio-technical systems along a spectrum between two extremes. At one extreme are certain prominent systems–e.g. Yelp and Amazon cultivating common pools of reviews–for which the architects and the users are distinct. The site architecture is a means to the ends of the architects, effected through the stimulation of user activity.

Architects acting on users through technology

Drawing on Winner (“Do artifacts have politics?”), we can see that this socio-technical arrangement establishes a particular pattern of power and authority. Architects have direct control over the technology, which enables to the limits of its affordances user activity. Users can influence architects through the information their activity generates (often collected through the medium of the technical architecture itself), but have no direct coercive control. Rather, architects design the technology to motivate certain desirable activity using inter-user feedback mechanisms such as ways of expressing gratitude or comparing one’s performance with others. (see Cheshire and Antin, “The Social Psychological Effects of Feedback on the Production of Internet Information Pools”, 2008) In such a system, users can only gain control of their technical environment by exploiting vulnerabilities in the architecture in adversarial moves that looks a bit like security breaches. (See the first exam question for an example of user-driven information sabotage.) More likely, the vast majority of users will choose to free ride on any common pool resources made available and exit the system when inconvenienced, as the environment is ultimately a transactional one of service provider and consumer.

In these circumstances, it is only by design that social behaviors lead to peer production and common pools of resources. Technology, as an expression of the interests of the architects, plays a more important role than social emergence. To clarify the point, I’d argue that Facebook, despite hosting enormous amounts of social activity, does not enable significant peer production because its main design goals are to drive the creation of proprietary user data and ad clicks. Twitter, in contrast, has from the beginning been designed as a more open platform. The information shared on it is often less personal, so activity more easily crosses the boundary from private to public, enabling collective action (see Bimber et al., “Reconceptuaizing Collective Action in the Contemporary Media Environment”, 2005) It has facilitated (with varying consistency) the creation of third party clients, as well as applications that interact with its data but can be hosted as separate sites.

This open architecture is necessary but not sufficient for emergent common pool behavior. But the design for open possibilities is significant. It enables the development of novel, intersecting architectures to support the creation of new common pools. Taking Weird Twitter, framed as a peer production community for high quality tweets, as an example, we can see how the service Favstar (which aggregates and ranks tweets that have been highly “starred” and retweeted, and awards congratulatory tweets as prizes) provides historical reminders and relative rankings of tweet quality. Thereby facilitates a culture of production. Once formed, such a culture can spread and make use of other available architecture as well. Weird Twitter has inspired Twitter: The Comic, a Tumblr account illustrating “the greatest tweets of our generation.”

Consider another extreme case, the free software community that Kelty identifies as the recursive public. (Two Bits: The Cultural Significance of Free Software) In an idealized model, we could say that in this socio-technical system the architects and the users are the same.

Recursive public diagram

The artifacts of the recursive public have a different politics than those at the other end of our spectrum, because the coercive aspects of the architectural design are the consequences of the emergent social behavior of those affected by it. Consequently, technology created in this way is rarely restrictive of productive potential, but on the contrary is designed to further empower the collaborative communities that produced it. The history of Unix, Mozilla, Emacs, version control systems, issue tracking software, Wikimedia, and the rest can be read as the historical unfolding of the human interest in an alternative, emancipated form of production. Here, the emergent social behavior claims its importance over and above the particulars of the technology itself.

reflective data science: technical, practical and emancipatory interests?

As Cathryn Carson is currently my boss at Berkeley’s D-Lab, it seems like it behooves me to read her papers. Thankfully, we share an interest in Habermasian epistemology. Today I read her “Science as instrumental reason: Heidegger, Habermas, Heisenberg.”

Though I can barely do justice to the paper, I’ll try to summarize: it grapples with the history of how science became constructed as purely instrumental project (a mode of inquiry that perfects means without specifying particular ends) through the interactions between Heisenberg, the premier theoretical physicist of Germany at his time, and Heidegger, the great philosopher, and then later the response to Heidegger by Habermas.

Heisenberg, most famous perhaps for the Heisenberg Uncertainty Principle, was himself reflective on the role of the scientist within science, and identified the limits of the subject and measurement within physics. But far from surpassing an older metaphysical idea of the subject-object divide, this only entrenched the scientist further, according to Heidegger. This is because scientist qua scientist never encounters the world in a way that is not tied up in the scientific, technical mode and so elludes pure being. While that may simply mean that pure being is left for philosophers and scientists are allowed to go on with their instrumental project, this mode of inquiry becomes insufficient when scientists were called on to comment on nuclear proliferation policy.

Such policy decisions are questions of praxis, or practical action in the human (as opposed to natural) world. Habermas was concerned with the hermeneutic epistemology of praxis, as well as the critical epistemology of emancipation, which are more the purview of the social sciences. Habermas tends to segment these modes of inquiry from each other, within (as far as I’ve encountered so far) anticipating a synthesis.

In data science, we see the broadly positivist, statistical, analytic treatment of social data. In its commercial applications to sell ads or conduct high-speed trading, we could say on a first pass that the science serves the technical human interest: prediction and control for some unspecified end. But that would be misleading. The breadth of methodological options available to the data scientist mean the the methods are often very closely tailored to the particular ends and conditions of the project. Data science as a method is an instrument. But the results of commercial data science are by and large not nomological (identifying laws of human behavior), but rather an immediately applied idiography. Or, more than an applied idiography, data science provides a probabilistic profile of its diverse subjects–an electron cloud of possibilities that the commercial data scientist uses to steer behavior en masse.

Of course, the uncertainty principle applies here as well: the human subject reacts to being measured, has the potential to change direction when they see that they are being targetted with this ad or that.

Further complicating the picture is that the application of ‘social technology’ of commercially driven data science is praxis, albeit in an apolitical sense. Enmeshed in a thick and complex technological web, nevertheless showing an ad and having it be clicked on is a move in the game of social relations. It is a handshake between cyborgs. And so even commercial data science must engage in hermeneutics, if Habermas is correct. Natural language processing provides the uncomfortable dge case here: can we have a technology that accomplishes hermeneutics for us? Apparently so, if a machine can identify somebody’s interest in a product or service from their linguistic output.

Though jarring, this is easier to cope with intellectually if we see the hermeneutic agent as a socio-technical system, as opposed to a purely technical system. Cyborg praxis will includes statistical/technical systems made of wires and silicon, just as meatier praxis includes statistical/technical systems made of proteins and cartilege.

But what of emancipation? This is the least likely human interest to be advanced by commercial interests. If I’ve got my bearings right, the emancipatory interest in the (social) sciences comes from the critical theory tradition, perhaps best exemplified in German thought by the Frankfurt School. One is meant to be emancipated by such inquiry from the power of the capitalist state. What would it mean for there to be an emancipatory data science?

I was recently asked out of the blue in an email whether there were any organizations using machine learning and predictive analytics towards social justice interests. I was ashamed to say I didn’t know of any organizations doing that kind of work. It is hard to imagine what an emancipatory data science would look like. An education or communication about data scientific techniques might be emancipatory (I was trying to accomplish something like this with Why Weird Twitter, for what its worth), but that was a qualitative study, not a data scientific one.

Taking our queue from above, an emancipatory data science would have to use data science methods towards the human interest of emancipation. For this, we would need to use the methods to understand the conditions of power and dependency that bind us. Difficult as an individual, it’s possible that these techniques could be used to greater effect by an emancipatory sociotechnical organization. Such an organization would need to be concerned with its own autonomy as well as the autonomy of others.

The closest thing I can imagine to such a sociotechnical system is what Kelty describes as the recursive public: the loose coallition of open source developers, open access researchers, and others concerned with transforming their social, economic, and technical conditions for emancipatory ends. Happily, the D-Lab’s technical infrastructure team appears to be populated entirely by citizens of the recursive public. Though this is naturally a matter of minor controversy within the lab (its hard to convince folks who haven’t directly experienced the emancipatory potential of the movement of its value), I’m glad that it stands on more or less robust historical grounds. While the course I am co-teaching on Open Collaboration and Peer Production will likely not get into critical theory, I expect that the exposure to more emancipated communities of praxis will make something click.

What I’m going for, personally, is a synthetic science that is at once technical and engaged in emancipatory praxis.

Virtual innovation clusters and tidal spillovers

I’ve recently begun a new project with Camilla Hrdy about government procurement as a local innovation incentive. Serendipitously, this has exposed me to literature around innovation clusters and spillover effects, as in this Fallah and Ibrahim literature review. This has been an “aha!” moment.

Innovation clusters are, in the literature, geographic places like Silicon Valley, Cambridge, Massachussetts, and other urban areas where there is a lot of R&D investment. Received wisdom is that these areas wind up driving the local economies in those areas by spillover effects, a market externality in which those beyond the intended beneficieries of an innovation benefit from it, normally through an informal knowledge transfer.

To economists in the 90′s, this was a significant and exceptional property of certain geographic places. To the digital native acclimated to Free Culture, this is a way of life. Spillovers are defined in terms of the intended boundaries of the recipients of innovative information. However, when there is no intended boundary you still get the spillover effect on innovation itself (however incomprehensible the incentives are to the 90′s economist). The Internet provides a virtual proximity that turns it into an innovation cluster. Advances in human computer interaction further enable this virtual proximity. We might say that Github, for example, is an innovation cluster with a higher degree of virtual proximity between its innovators within the larger virtual innovation cluster that includes SourceForge, Bitbucket, and everything else. (Considering software engineering as the particular case here.)

By binding together other innovation clusters, this virtual proximity leads to the innovation explosion we’ve seen in the past 10 or so years. “Everything changes so fast.” It’s true!

Outside of the software environment, we can point to other virtual innovation clusters such as Weird Twitter, where virtual proximity and spillover effects are used to innovate rapidly in humor.

The drive to open access academic research is due in part to an understanding of these spillover effects. You increase impact by encouraging spillover. I.e., you try to make waves. Academic research becomes more like speciality journalism in the sense that you try to break a story globally, not just to a particular academic community. The speed of innovation in such a dynamic environment is bewildering and perhaps the university tenure-based incentive system is not well designed to handle it, but nevertheless these are the times.

Jack Burris at the Berkeley D-Lab likes to say that the D-Lab is designed to support ‘collisions’ between researchers in differnet fields. “Spillovers” might be a term with more literature behind it. Indeed, interdisciplinarity needs to start with collisions or spillovers because that is what creates mixing between siloed innovation. I’ve heard that Soo and Carson’s paper about Clark Kerr as an industrial organizer explain some of the idiosyncracies of Berkeley in particular as an accumulation of silos.

Which explains the D-Lab’s current agenda as a mix of open source evangelism, reproducible research technology adoption, technical skills training for social scientists, and eshewer of disciplinary distinctions. If Berkeley’s success as a research institution depends on its being an effective innovation cluster, even within the larger innovation cluster that is the coast of Northern California, then it will need to increase the virtual proximity of its constituent innovators. Furthermore, this will expose non-local actors to spillovers from Berkeley, and perhaps Berkeley from spillovers from other institutions. This is of course a shift in degree, not kind, from the way the academic system already works in the economy. But what’s new is the use of disruptive infrastructure to accelerate the process.

This would all be wonderful if it were not also tilting towards a crisis, since its unclear how the human beings in the system are meant to adapt to these rapid changes. What is scholarship when the body of literature available on a particular topic is no longer strictly filtered by a hierarchical community of practice but rather is available to anybody with the (admittedly often specialized, but increasingly available) literacy? Is expertise just a matter of having the leisure and discretion to retweet the latest and greatest? Or do you make it by positioning yourself skillfully at the right point of the long tail? Or is this once-glorified social role of intellectual labor now just a perfunctory routine we can replace with ranks of amateurs?

To be a good expert is to be a good node. Not a central node, not a loud node, just a good node. This is humbling for experts, but these are the times.

How to tell the story about why stories don’t matter

I’m thinking of taking this seminar because I’m running into the problem it addresses: how do you pick a theoretical lens for academic writing?

This is related to a conversation I’ve found myself in repeatedly over the past weeks. A friend who studied Rhetoric insists that the narrative and framing of history is more important than the events and facts. A philosopher friend minimizes the historical impact of increased volumes of “raw footage”, because ultimately it’s the framing that will matter.

Yesterday I had the privilege of attending Techraking III, a conference put on by the Center for Investigative Reporting with the generous support and presence of Google. It was a conference about data journalism. The popular sentiment within the conference was that data doesn’t matter unless it’s told with a story, a framing.

I find this troubling because while I pay attention to this world and the way it frames itself, I also read the tech biz press carefully, and it tells a very different narrative. Data is worth billions of dollars. Even data exhaust, the data fumes that come from your information processing factory, can be recycled into valuable insights. Data is there to be mined for value. And if you are particularly genius at it, you can build an expert system that acts on the data without needing interpretation. You build an information processing machine that acts according to mechanical principles that approximate statistical laws, and these machines are powerful.

As social scientists realize they need to be data scientists, and journalists realize they need to be data journalists, there seems to be in practice a tacit admission of the data-driven counter-narrative. This tacit approval is contradicted by the explicit rhetoric that glorifies interpretation and narrative over data.

This is an interesting kind of contradiction, as it takes place as much in the psyche of the data scientist as anywhere else. It’s like the mouth doesn’t know what the hand is doing. This is entirely possible since our minds aren’t actually that coherent to start with. But it does make the process of collaboratively interacting with others in the data science field super complicated.

All this comes to a head when the data we are talking about isn’t something simple like sensor data about the weather but rather is something like text, which is both data and narrative simulatenously. We intuitively see the potential of treating narrative as something to be treated mechanically, statistically. We certainly see the effects of this in our daily lives. This is what the most powerful organizations in the world do all the time.

The irony is that the interpretivists, who are so quick to deny technological determinism, are the ones who are most vulnerable to being blindsided by “what technology wants.” Humanities departments are being slowly phased out, their funding cut. Why? Do they have an explanation for this? If interpetation/framing were as efficacious as they claim, they would be philosopher kings. So their sociopolitical situation contradicts their own rhetoric and ideology. Meanwhile, journalists who would like to believe that it’s the story that matters are, for the sake of job security, being corralled into classes to learn CSS, the programming language that determines, mechanically, the logic of formatting and presentation.

Sadly, neither mechanists nor interpretivists have much of an interest in engaging this contradiction. This is because interpretivists chase funding by reinforcing the narrative that they are critically important, and the work of mechanists speaks for itself in corporate accounting (an uninterpretive field) without explanation. So this contradiction falls mainly into the laps of those coordinating interaction between tribes. Managers who need to communicate between engineering and marketing. University administrators who have to juggle the interests of humanities and sciences. The leadership of investigative reporting non-profits who need to justify themselves to savvy foundations and who are removed enough from particular skillsets to be flexible.

Mechnanized information processing is becoming the new epistemic center. (Forgive me:) the Google supercomputer approximating statistics has replaced Kantian trancendental reason as the grounds for bourgious understanding of the world. This is threatening, of course, to the plurality of perspectives that do not themselves internalize the logic of machine learning. Where machine intelligence has succeeded, then, it has been by juggling this multitude of perspectives (and frames) through automated, data-driven processes. Machine intelligence is not comprehensible to lay interpretivism. Interestingly, lay interpetivism isn’t comprehensible yet to machine intelligence–natural language processing has not yet advanced so far. It treats our communications like we treat ants in an ant farm: a blooming buzzing confusion of arbitrary quanta, fascinatingly complex for its patterns that we cannot see. And when it makes mistakes–and it does often–we feel its effects as a structural force beyond our control. A change in the user interface of Facebook that suddenly exposes drunken college photos to employers and abusive ex-lovers.

What theoretical frame is adequate to tell this story, the story that’s determining the shape of knowledge today? For Lyotard, the postmodern condition is one in which metanarratives about the organization of knowledge collapse and leave only politics, power, and language games. The postmodern condition has gotten us into our present condition: industrial machine intelligence presiding over interpretivists battling in paralogical language games. When the interpretivists strike back, it looks like hipsters or Weird Twitter–paralogy as a subculture of resistance that can’t even acknowledge its own role as resistance for fear of recuperation.

We need a new metanarrative to get out of this mess. But what kind of theory could possibly satisfy all these constituents?

Reinventing wheels with Dissertron

I’ve found a vehicle for working on the Dissertron through the website of the course I’ll be co-teaching this Fall on Open Collaboration and Peer Production.

In the end, I went with Pelican, not Hyde. It was a difficult decision (because I did like the idea of supporting an open source project coming out of Chennai, especially after reading Coding Places. But I had it on good authority that Pelican was more featureful with cleaner code. So here I am.

The features I have in mind are crystalizing as I explore the landscape of existing tools more. This is my new list:

  • Automatically include academic metadata on each Dissertron page so it’s easy to slurp it into Zotero.
  • Include the Hypothes.is widget for annotations. I think Hypothes.is will be better for commenting that Disqus because it does annotations in-line, as opposed to comments in the footer. It also uses the emerging W3C Open Annotation standard. I’d like this to be as standards based as possible.
  • Use citeproc-js to render citations in the browser cleanly. I think this handles the issue of in-line linked academic citations without requiring a lot of manual work. The citeproc-js looks like it’s come out of Zotero as well. Since Elsevier bought Mendeley, Zotero seems like the more reliable ally to pick for independent scholarship.
  • Trickiest is going to be porting a lot of features from jekyll-scholar into a Pelican plug-in. I really want jekyll-scholar‘s bibliographic management. But I’m a little worried that Pelican isn’t well-designed for that sort of flexibility in theming. More soon.
  • I’m interested in trying to get the HTML output of Dissertron as close as possible to emerging de facto standards on what on-line scholarship is like. I’ve asked about what PLOS ONE does about this. The answer sounds way complicated: a tool chain the goes from Latex to Word Docs to NLM 3.0 XML (which I didn’t even know was a thing), and at last into HTML. I’m trying to start from Markdown because I think it’s a simple markup language for the future, but I’m not deep enough in that tool chain to understand how to replicate its idiosyncracies.

If I could have all these nice things, and maybe a pony, then I would be happy and have no more excuses for not actually doing research, as opposed to obsessing about the tooling around independent publishing.

Dissertron build notes

I’m going to start building the Dissertron now. These are my notes.

  • I’m going with Hyde as a static site generator on Nick Doty‘s recommendation. It appears to be tracking Jekyll in terms of features, but squares better with my Python/Django background (it uses Jinja2 templates in its current, possibly-1.0-but-under-development version). Meanwhile, at Berkeley we seem to be investing a lot in Python as the language of scientific computing. If scientists skills should be transferrable to their publication tool, this seems like the way to go.
  • Documentation for Hyde is a bit scattered. This first steps guide is sort of helpful, and then there are these docs hosted on Github. As mentioned, they’ve moved away from Django templates to Jinja2, which is similar but less idiosyncratic. They refer you to the Jinja2 docs here for templating.
  • Just trying to make a Hello World type site, I ran into an issue with Markdown rendering. I’ve filed an issue with the project, and will use it as a test of the community’s responsiveness. Since Hyde is competing with a lot of other Python static site generators, it’s kind of nice to bump into this kind of thing early.
  • Got this response from the creator of Hyde in less than 3 hours. Problem was with my Jinja2 fu (which is weak at the moment)–turns out I have a lot to learn about Whitespace Control. Super positive community experience. I’ll stick with Hyde.
  • “Hello World” intact and framework chosen, my next step is to convert part 2 of my Weird Twitter work to Markdown and use Hyde’s tools to give it some decent layout. If I can make some headway on the citation formating and management in the process, so much the better.

Complications in Scholarly Hypertext

I’ve got a lot of questions about on-line academic publishing. A lot of this comes from career anxiety: I am not a very good academic because I don’t know how to write for academic conferences and journals. But I’m also coming from an industry that is totally eating the academy’s lunch when it comes to innovating and disseminating information. People within academia are increasingly feeling the disruptive pressure of alternative publication venues and formats, and moreover seeing the need for alternatives for the sake of the intellectual integrity of the whole enterprise. Open science, open data, reproducible research–these are keywords for new practices that are meant to restore confidence in science itself, in part by making it more accessible.

One manifestation of this trend is the transition of academic group blogs into academic quasi-journals or on-line magazines. I don’t know how common this is, but I recently had a fantastic experience of this writing for Ethnography Matters. Instead of going through an opaque and problematic academic review process, I worked with editor Rachelle Annechino to craft a piece about Weird Twitter that was appropriate for the edition and audience.

During the editing process, I tried to unload everything I had to say about Weird Twitter so that I could at last get past it. I don’t consider myself an ethnographer and I don’t want to write my dissertation of Weird Twitter. But Rachelle encouraged me to split off the pseudo-ethnographic section into a separate post, since the first half was more consistent with the Virtual Identity edition. (Interesting how the word “edition”, which has come to mean “all the copies of a specific issue of a newspaper”, in the digital context returns to its etymological roots as simply something published or produced (past participle)).

Which means I’m still left with the (impossible) task of doing an ethnography (something I’m not very well trained for) about Weird Twitter (which might not exist). Since I don’t want to violate the contextual integrity of Weird Twitter more than I already have, I’m reluctant to write about it in a non-Web-based medium.

This carries with it a number of challenges, not least of which is the reception on Twitter itself.

What my thesaurus and I do in the privacy of our home is our business and anyway entirely legal in the state of California. But I’ve come to realize that forced disclosure is an occupational hazard I need to learn to accept. What these remarks point to, though, is the tension between access to documents as data and access to documents as sources of information. The latter, as we know from Claude Shannon, requires an interpreter who can decode the language in which the information is written.

Expert language is a prison for knowledge and understanding. A prison for intellectually significant relationships. It is time to move beyond the institutional practices of triviledge

- Taylor and Saarinen, 1994, quoted in Kolb, 1997

Is it possible to get away from expert language in scholarly writing? Naively, one could ask experts to write everything “in plain English.” But that doesn’t do language justice: often (though certainly not always) new words express new concepts. Using a technical vocabulary fluently requires not just a thesaurus, but an actual understanding of the technical domain. I’ve been through the phase myself in which I thought I knew everything and so blamed anything written opaquely to me on obscurantism. Now I’m humbler and harder to understand.

What is so promising about hypertext as a scholarly medium is that it offers a solution to this problem. Wikipedia is successful because it directly links jargon to further content that explains it. Those with the necessary expertise to read something can get the intended meaning out of an article, and those that are confused by terminology can romp around learning things. Maybe they will come back to the original article later with an expanded understanding.

xkcd: The Problem with Wikipedia

Hypertext and hypertext-based reading practices are valuable for making ones work open and accessible. But it’s not clear how to combine these with scholarly conventions on referencing and citations. Just to take Ethnography Matters as an example, for my article I used in-line linking and where I got to it parenthetical bibliographic information. Contrast with Heather Ford’s article in the same edition, which has no links and a section at the end for academic references. The APA has rules for citing web resources within an academic paper. What’s not clear is how directly linking citations within an academic hypertext document should work.

One reason for lack of consensus around this issue is that citation formatting is a pain in the butt. For off-line documents, word processing software has provided myriad tools for streamlining bibliographic work. But for publishing academic work on the web, we write in markup languages or WYSIWIG editors.

Since standards on the web tend to evolve through “rough consensus and running code”, I expect we’ll see a standard for this sort of thing emerge when somebody builds a tool that makes it easy for them to follow. This leads me back to fantasizing about the Dissertron. This is a bit disturbing. As much as I’d like to get away from studying Weird Twitter, I see now that a Weird Twitter ethnography is the perfect test-bed for such a tool precisely because of the hostile scrutiny it would attract.

Follow

Get every new post delivered to your Inbox.

Join 715 other followers