Digifesto

Tag: data science

Reflections on the Berkeley Institute for Data Science (BIDS) Launch

Last week was the launch of the Berkeley Institute for Data Science.

Whatever might actually happen as a result of the launch, what was said at the launch was epic.

Vice Chancellor of research Graham Flemming introduced Chancellor Nicholas Dirks for the welcoming remarks. Dirks is UC Berkeley’s 10th Chancellor. He succeeded Robert Birgeneau, who resigned gracefully shortly after coming under heavy criticism for his handling of Occupy Cal, the Berkeley campus’ chapter of the Occupy movement. He was distinctly unsympathetic to the protesters, and there was a widely circulated petition declaring a lack of confidence in his leadership. Birgeneau is a physicist. Dirks is an anthropologist who has championed postcolonial approaches. Within the politics of the university, which are a microcosm of politics at large, this signalling is clear. Dirks’ appointment was meant to satisfy the left wing protesters, most of whom have been trained in softer social sciences themselves. Critical reflection on power dynamics and engagement in activism–which is often associated with leftist politics–are, at least formally, accepted by the university administration as legitimate. Birgeneau would subsequently receive awards for his leadership in drawing more women into the sciences and aiding undocumented students.

Dirks’ welcoming remarks were about the great accomplishments of UC Berkeley as a research institution and the vague but extraordinary potential of BIDS. He is grateful, as we all are, for the funding from the Moore and Sloan foundations. I found his remarks unspecific, and I couldn’t help but wonder what his true thoughts were about data science in the university. Surely he must have an opinion. As an anthropologist, can he consistently believe that data science, especially in the social sciences, is the future?

Vicki Chandler, Chief Program Officer from the Moore Foundation, was more lively. Pulling no punches, she explained that the purpose of BIDS is to shake up scientific culture. Having hung out in Berkeley in the 60’s and attended it as an undergraduate in the 70’s, she believes we are up for it. She spoke again and again of “revolution”. There is ambiguity in this. In my experience, faculty are divided on whether they see the proposed “open science” changes as imminent or hype, as desirable or dangerous. More and more I see faculty acknowledge that we are witnessing the collapse of the ivory tower. It is possible that the BIDS launch is a tipping point. What next? “Let the fun begin!” concluded Chandler.

Saul Perlmutter, Nobel laureate physicist and front man of the BIDS co-PI super group, gave his now practiced and condensed pitch for the new Institute. He hit all the high points, pointing out not only the potential of data science but the importance of changing the institutions themselves. Rethinking the peer-review journal from scratch, he said, we should focus more on code reuse. Software can be a valid research output. As much as open science is popular among the new generation of scientists, this is a bold statement for somebody with such credibility within the university. He even said that the success of open source software is what gives us hope for the revolutionary new kind of science BIDS is beginning. Two years ago, this was a fringe idea. Perlmutter may have just made it mainstream.

Notably, he also engaged with the touchy academic politics, saying that data science could bring diversity to the sciences (though he was unspecific about the mechanism for this). He expounded on the important role of ethnography in evaluating the Institute to identify the bottlenecks to its unlocking its potential.

The man has won at physics and is undoubtedly a scientist par excellance. Perhaps Perlmutter sees the next part of his legacy as the bringing of the university system into the 21st century.

David Culler, Chair of the Electrical Engineering and Computer Science department, then introduced a number of academic scientists, each with impressive demonstrations about how data science could be applied to important problems like climate change and disaster reduction. Much of this research depends on using the proliferation of hand-held mobile devices as sensors. University science, I realized while watching this, is at its best when doing basic research about how to save humanity from nature or ourselves.

But for me the most interesting speakers in the first half of the launch were luminaries Peter Norvig and Tim O’Reilly, each giants in their own right and welcome guests to the university.

Culler introduced Norvig, Director of Research at Google, by crediting him as one of the inventors of the MOOC. I know his name mainly as a co-author of “Artificial Intelligence: A Modern Approach,” which I learned and taught from as an undergraduate. Amazingly, Norvig’s main message is about the economics of the digital economy. Marginal production is cheap, cost of communication is cheap, and this leads to an accumulation of wealth. Fifty percent of jobs are predicted to be automated away in the coming decades. He is worried about the 99%–freely using Occupy rhetoric. What will become of them? Norvig’s solution, perhaps stated tongue in cheek, is that everyone needs to become a data scientist. More concretely, he has high hopes for hybrid teams of people and machines, that all professions will become like this. By defining what academic data science looks like and training the next generation of researchers, BIDS will have a role in steering the balance of power between humanity and the machines–and the elite few who own them.

His remarks hit home. He touched on anxieties that are as old as the Industrial Revolution: is somebody getting immensely rich off of these transformations, but not me? What will my role be in this transformed reality? Will I find work? These are real problems and Norvig was brave to bring them up. The academics in the room were not immune from these anxieties either, as they watch the ivory tower crumble around them. This would come up again later in the day.

I admire him for bringing up the point, and I believe he is sincere. I’d heard him make the same points when he was on a panel with Neil Stephenson and Jaron Lanier a month or so earlier. I can’t help but be critical of Norvig’s remarks. Is he covering his back? Many university professors are seeing MOOCs themselves as threatening to their own careers. It is encouraging that he sees the importance of hybrid human/machine teams. If the machines are built on Google infrastructure, doesn’t this contribute to the same inequality he laments, shifting power away from teachers to the 1% at Google? Or does he foresee a MOOC-based educational boom?

He did not raise the possibility that human/machine hybridity is already the status quo–that, for example, all information workers tap away at these machines and communicate with each other through a vast technical network. If he had acknowledged that we are all cyborgs already, he would have had to admit that hybrid teams of humans and machines are as much the cause of as solution to economic inequality. Indeed, this relationship between human labor and mechanical capital is precisely the same as the one that created economic inequality in the Industrial Revolution. When the capital is privately owned, the systems of hybrid human/machine productivity favor the owner of the machines.

I have high hopes that BIDS will address through its research Norvig’s political concern. It is certainly on the mind of some of its co-PI’s, as later discussion would show. But to address the problem seriously, it will have to look at the problem in a rigorous way that doesn’t shy away from criticism of the status quo.

The next speaker, Tim O’Reilly, is a figure who fascinates me. Culler introduced him as a “God of the Open Source Field,” which is poetically accurate. Before coming to academia, I worked on Web 2.0 open source software platforms for open government. My career was defined by a string of terms invented and popularized by O’Reilly, and to a large extent I’m still a devotee of his ideas. But as a practitioner and researcher, I’ve developed a nuanced view of the field that I’ve tried to convey in the course on Open Collaboration and Peer Production I’ve co-instructed with Thomas Maillart this semeser.

O’Reilly came under criticism earlier this year from Evgeny Morozov, who attacked him for marketing politically unctuous ideas while claiming to be revolutionary. He focuses on his promotion of ‘open source’ over and against Richard Stallman’s explicitly ethical and therefore contentious term ‘free software‘. Morozov accuses O’Reilly of what Tom Scocca has recently defined as rhetorical smarm–dodging specific criticism by denying the appropriateness of criticism in general. O’Reilly has disputed the Morozov piece. Elsewhere he has presented his strategy as a ‘marketer of big ideas‘, and his deliberate promoting of more business-friendly ‘open source’ rhetoric. This ideological debate is itself quite interesting. Geek anthropologist Chris Kelty observes that it is participation in this debate, more so than an adherence to any particular view in it, that characterizes the larger “movement,” which he names the recursive public.

Despite his significance to me, with an open source software background, I was originally surprised when I heard Tim O’Reilly would be speaking at the BIDS launch. O’Reilly had promoted ‘open source’ and ‘Web 2.0’ and ‘open government’, but what did that have to do with ‘data science’?

So I was amused when Norvig introduced O’Reilly by saying that he didn’t know he was a data scientist until the latter wrote an article in Forbes (in November 2011) naming him one of “The World’s 7 Most Powerful Data Scientists.” Looking at the Google Trends data, we can see that November 2011 just about marks the rise of ‘data science’ from obscurity to popularity. Is Tim O’Reilly responsible for the rise of ‘data science’?

Perhaps. O’Reilly’s explained that he got into data science by thinking about the end game for open source. As open source software becomes commodified (which for him I think means something like ‘subject to competitive market pressure), what becomes valuable is the data. And so he has been promoting data science in industry and government, and believes that the university can learn important lessons from those fields as well. He held up his Moto X phone, explained how it is ‘always listening’ and so can facilitate services like Google Now. All this would go towards a system with greater collective intelligence, a self-regulating system that would make regulators obsolete.

Looking at the progression of the use of maps, from paper to digital to being embedded in services and products like self-driving cars, O’Reilly agrees with Norvig about the importance of human-machine interaction. In particular, he believes that data scientists will need to know how to ask the right questions about data, and that this is the future of science. “Others will be left behind,” he said, not intending to sound foreboding.

I thought O’Reilly presented the combination of insight and boosterism I expected. To me, his presence at the BIDS launch meant to me that O’Reilly’s significance as a public intellectual has progressed from business through governance and now to scientific thinking itself. This is wonderful for him but means that his writings and influence should be put under the scrutiny we would have for an academic peer. It is appropriate to call him out for glossing over the privacy issues around a mobile phone that is “always listening,” or the moral implications of the obsolescence of regulators for equality and justice. Is his objectivity compromised by the fact that he runs a publishing company that sells complementary goods to the vast supply of publicly available software and data? Does his business agenda incentivize him to obscure the subtle differences between various segements of his market? Are we in the university victims of that obscurity as we grapple with multiple conflated meanings of “openness” in software and science (open to scrutiny and accountability, vs. open for appropriation by business, vs. open to meritocratic contribution)? As we ask these questions, we can be grateful to O’Reilly for getting us this far.

I’ve emphasized the talks given by Norvig and O’Reilly because they exposed what I think are some of the most interesting aspects of BIDS. One way or another, it will be revolutionary. Its funders will be very disappointed if it is not. But exactly how it is revolutionary is undetermined. The fact that BIDS is based in Berkeley, and not in Google or Microsoft or Stanford, guarantees that the revolution will not be an insipid or smarmy one which brushes aside political conflict or morality. Rather, it promises to be the site of fecund political conflict. “Let the fun begin!” said Chandler.

The opening remarks concluded and we broke for lunch and poster sessions–the Data Science Faire (named after O’Reilly’s Maker Faire…

What followed was a fascinating panel discussion led by astrophysicist Josh Bloom, historian and university administrator Cathryn Carson, computer science professor and AMP Lab director Michael Franklin, and Deb Agrawal, a staff computer scientist for Lawrence Berkeley National Lab.

Bloom introduced the discussion jokingly as “just being among us scientists…and whoever is watching out there on the Internet,” perhaps nodding to the fact that the scientific community is not yet fully conscious that their expectations of privileged communication are being challenged by a world and culture of mobile devices that are “always listening.”

The conversation was about the role of people in data science.

Carson spoke as a domain scientist–a social scientist who studies scientists. Noting that social scientists tend to work in small teams lead by graduate students motivated by their particular questions, she said her emphasis was on the people asking questions. Agrawal noted that the number of people needed to analyze a data set does not scale with the size of data, but the complexity of data–a practical point. (I’d argue that theoretically we might want to consider “size” of data in terms of its compressibility–which would reflect its complexity. This ignores a number of operational challenges.) For Franklin, people are a computational resource that can be part of a crowd-sourced process. In that context, the number of people needed does indeed scale with the use of people as data processors and sensors.

Perhaps to follow through on Norvig’s line of reasoning, Bloom then asked pointedly if machines would ever be able to do the asking of questions better than human beings. In effect: Would data science make data scientists obsolete?

Nobody wanted to be the first to answer this question. Bloom had to repeat it.

Agrawal took a first stab at it. The science does not come from the data; the scientist chooses models and tests them. This is the work of people. Franklin agreed and elaborated–the wrong data too early can ruin the science. Agrawal noted that computers might find spurious signals in the noise.

Personally, I find these unconvincing answers to Bloom’s question. Algorithms can generate, compare, and test alternative models against the evidence. Noise can, with enough data, be filtered away from the signal. To do so pushes the theoretical limits of computing and information theory, but if Franklin is correct in his earlier point that people are part of the computational process, then there is no reason in principle why these tasks too might not be performed if not assisted by computers.

Carson, who had been holding back her answer to listen to the others, had a bolder proposal: rather than try to predict the future of science, why not focus on the task of building that future?

In another universe, at that moment someone might have asked the one question no computer could have answered. “If we are building the new future of science, what should we build? What should it look like? And how do we get there?” But this is the sort of question disciplined scientists are trained not to ask.

Instead, Bloom brought things back to practicality: we need to predict where science will go in order to know how to educate the next generation of scientists. Should we be focusing on teaching them domain knowledge, or on techniques?

We have at the heart of BIDS the very fundamental problem of free will. Bloom suggests that if we can predict the future, then we can train students in anticipation of it. He is an astrophysics and studies stars; he can be forgiven for the assumption that bodies travel in robust orbits. This environment is a more complex one. How we choose to train students now will undoubtedly affect how science evolves, as the process of science is at once the process of learning and training new scientists. His descriptive question then falls back to the normative one: what science are we trying to build toward?

Carson was less heavy-handed than I would have been in her position. Instead, she asked Bloom how he got interested in data science. Bloom recalled his classical physics training, and the moment he discovered that to answer the kinds of questions he was asking, he would need new methods.

Franklin chimed in on the subject of education. He has heard it said that everyone in the next generation should learn to code. With marked humility for his discipline, he said he did not agree with this. But he said he did believe that everyone in the next generation should learn data literacy, echoing Norvig.

Bloom opened the discussion to questions from the audience.

The first was about the career paths for methodologists who write software instead of papers. How would BIDS serve them? It was a soft ball question which the panel hit out of the park. Bloom noted that the Moore and Sloan funders explicitly asked for the development of alternative metrics to measure the impact of methodologist contributions. Carson said that even with the development of metrics, as an administrator she knew it would be a long march through the institution to get those metrics recognized. There was much work to be done. “Universities got to change,” she rallied. “If we don’t change, Berkeley’s being great in the past won’t make it great in the future,” referring perhaps to the impressive history of research recounted by Chancellor Dirks. There was applause. Franklin pointed out that the open source community has its own metrics already. In some circles some of his students are more famous than he is for developing widely used software. Investors are often asking him when his students will graduate. The future, it seems, is bright for methodologists.

At this point I lost my Internet connection and had to stop livetweeting the panel; those tweets are the notes from which I am writing these reflections. Recalling from memory, there was one more question from Kristina Kangas, a PhD student in Integrative Biology. She cited research about how researchers interpreting data wind up reflecting back their own biases. What did this mean for data science?

Bloom gave Carson the last word. It is a social scientific fact, she said, that scientists interpret data in ways that fit their own views. So it’s possible that there is no such thing as “data literacy”. These are open questions that will need to be settled by debate. Indeed, what then is data science after all? Turning to Bloom, she said, “I told you I would be making trouble.”

Reflexive data science

In anticipation of my dissertation research and in an attempt to start a conversation within the emerging data science community at Berkeley, I’m working on a series of blog posts about reflexive data science. I will update this post with an index of them and related pieces as they are published over time.

“Reflexive data science: an overview”, UC Berkeley D-Lab Blog.
Explaining how the stated goals of the Berkeley Institute of Data Science–open source, open science, alt-metrics, and empirical evaluation–imply the possibility of an iterative, scientific approach to incentivizing scientists.

reflective data science: technical, practical and emancipatory interests?

As Cathryn Carson is currently my boss at Berkeley’s D-Lab, it seems like it behooves me to read her papers. Thankfully, we share an interest in Habermasian epistemology. Today I read her “Science as instrumental reason: Heidegger, Habermas, Heisenberg.”

Though I can barely do justice to the paper, I’ll try to summarize: it grapples with the history of how science became constructed as purely instrumental project (a mode of inquiry that perfects means without specifying particular ends) through the interactions between Heisenberg, the premier theoretical physicist of Germany at his time, and Heidegger, the great philosopher, and then later the response to Heidegger by Habermas.

Heisenberg, most famous perhaps for the Heisenberg Uncertainty Principle, was himself reflective on the role of the scientist within science, and identified the limits of the subject and measurement within physics. But far from surpassing an older metaphysical idea of the subject-object divide, this only entrenched the scientist further, according to Heidegger. This is because scientist qua scientist never encounters the world in a way that is not tied up in the scientific, technical mode and so elludes pure being. While that may simply mean that pure being is left for philosophers and scientists are allowed to go on with their instrumental project, this mode of inquiry becomes insufficient when scientists were called on to comment on nuclear proliferation policy.

Such policy decisions are questions of praxis, or practical action in the human (as opposed to natural) world. Habermas was concerned with the hermeneutic epistemology of praxis, as well as the critical epistemology of emancipation, which are more the purview of the social sciences. Habermas tends to segment these modes of inquiry from each other, without (as far as I’ve encountered so far) anticipating a synthesis.

In data science, we see the broadly positivist, statistical, analytic treatment of social data. In its commercial applications to sell ads or conduct high-speed trading, we could say on a first pass that the science serves the technical human interest: prediction and control for some unspecified end. But that would be misleading. The breadth of methodological options available to the data scientist mean the the methods are often very closely tailored to the particular ends and conditions of the project. Data science as a method is an instrument. But the results of commercial data science are by and large not nomological (identifying laws of human behavior), but rather an immediately applied idiography. Or, more than an applied idiography, data science provides a probabilistic profile of its diverse subjects–an electron cloud of possibilities that the commercial data scientist uses to steer behavior en masse.

Of course, the uncertainty principle applies here as well: the human subject reacts to being measured, has the potential to change direction when they see that they are being targetted with this ad or that.

Further complicating the picture is that the application of ‘social technology’ of commercially driven data science is praxis, albeit in an apolitical sense. Enmeshed in a thick and complex technological web, nevertheless showing an ad and having it be clicked on is a move in the game of social relations. It is a handshake between cyborgs. And so even commercial data science must engage in hermeneutics, if Habermas is correct. Natural language processing provides the uncomfortable edge case here: can we have a technology that accomplishes hermeneutics for us? Apparently so, if a machine can identify somebody’s interest in a product or service from their linguistic output.

Though jarring, this is easier to cope with intellectually if we see the hermeneutic agent as a socio-technical system, as opposed to a purely technical system. Cyborg praxis will includes statistical/technical systems made of wires and silicon, just as meatier praxis includes statistical/technical systems made of proteins and cartilege.

But what of emancipation? This is the least likely human interest to be advanced by commercial interests. If I’ve got my bearings right, the emancipatory interest in the (social) sciences comes from the critical theory tradition, perhaps best exemplified in German thought by the Frankfurt School. One is meant to be emancipated by such inquiry from the power of the capitalist state. What would it mean for there to be an emancipatory data science?

I was recently asked out of the blue in an email whether there were any organizations using machine learning and predictive analytics towards social justice interests. I was ashamed to say I didn’t know of any organizations doing that kind of work. It is hard to imagine what an emancipatory data science would look like. An education or communication about data scientific techniques might be emancipatory (I was trying to accomplish something like this with Why Weird Twitter, for what its worth), but that was a qualitative study, not a data scientific one.

Taking our cue from above, an emancipatory data science would have to use data science methods towards the human interest of emancipation. For this, we would need to use the methods to understand the conditions of power and dependency that bind us. Difficult as an individual, it’s possible that these techniques could be used to greater effect by an emancipatory sociotechnical organization. Such an organization would need to be concerned with its own autonomy as well as the autonomy of others.

The closest thing I can imagine to such a sociotechnical system is what Kelty describes as the recursive public: the loose coallition of open source developers, open access researchers, and others concerned with transforming their social, economic, and technical conditions for emancipatory ends. Happily, the D-Lab’s technical infrastructure team appears to be populated entirely by citizens of the recursive public. Though this is naturally a matter of minor controversy within the lab (its hard to convince folks who haven’t directly experienced the emancipatory potential of the movement of its value), I’m glad that it stands on more or less robust historical grounds. While the course I am co-teaching on Open Collaboration and Peer Production will likely not get into critical theory, I expect that the exposure to more emancipated communities of praxis will make something click.

What I’m going for, personally, is a synthetic science that is at once technical and engaged in emancipatory praxis.

Complications in Scholarly Hypertext

I’ve got a lot of questions about on-line academic publishing. A lot of this comes from career anxiety: I am not a very good academic because I don’t know how to write for academic conferences and journals. But I’m also coming from an industry that is totally eating the academy’s lunch when it comes to innovating and disseminating information. People within academia are increasingly feeling the disruptive pressure of alternative publication venues and formats, and moreover seeing the need for alternatives for the sake of the intellectual integrity of the whole enterprise. Open science, open data, reproducible research–these are keywords for new practices that are meant to restore confidence in science itself, in part by making it more accessible.

One manifestation of this trend is the transition of academic group blogs into academic quasi-journals or on-line magazines. I don’t know how common this is, but I recently had a fantastic experience of this writing for Ethnography Matters. Instead of going through an opaque and problematic academic review process, I worked with editor Rachelle Annechino to craft a piece about Weird Twitter that was appropriate for the edition and audience.

During the editing process, I tried to unload everything I had to say about Weird Twitter so that I could at last get past it. I don’t consider myself an ethnographer and I don’t want to write my dissertation of Weird Twitter. But Rachelle encouraged me to split off the pseudo-ethnographic section into a separate post, since the first half was more consistent with the Virtual Identity edition. (Interesting how the word “edition”, which has come to mean “all the copies of a specific issue of a newspaper”, in the digital context returns to its etymological roots as simply something published or produced (past participle)).

Which means I’m still left with the (impossible) task of doing an ethnography (something I’m not very well trained for) about Weird Twitter (which might not exist). Since I don’t want to violate the contextual integrity of Weird Twitter more than I already have, I’m reluctant to write about it in a non-Web-based medium.

This carries with it a number of challenges, not least of which is the reception on Twitter itself.

What my thesaurus and I do in the privacy of our home is our business and anyway entirely legal in the state of California. But I’ve come to realize that forced disclosure is an occupational hazard I need to learn to accept. What these remarks point to, though, is the tension between access to documents as data and access to documents as sources of information. The latter, as we know from Claude Shannon, requires an interpreter who can decode the language in which the information is written.

Expert language is a prison for knowledge and understanding. A prison for intellectually significant relationships. It is time to move beyond the institutional practices of triviledge

– Taylor and Saarinen, 1994, quoted in Kolb, 1997

Is it possible to get away from expert language in scholarly writing? Naively, one could ask experts to write everything “in plain English.” But that doesn’t do language justice: often (though certainly not always) new words express new concepts. Using a technical vocabulary fluently requires not just a thesaurus, but an actual understanding of the technical domain. I’ve been through the phase myself in which I thought I knew everything and so blamed anything written opaquely to me on obscurantism. Now I’m humbler and harder to understand.

What is so promising about hypertext as a scholarly medium is that it offers a solution to this problem. Wikipedia is successful because it directly links jargon to further content that explains it. Those with the necessary expertise to read something can get the intended meaning out of an article, and those that are confused by terminology can romp around learning things. Maybe they will come back to the original article later with an expanded understanding.

xkcd: The Problem with Wikipedia

Hypertext and hypertext-based reading practices are valuable for making ones work open and accessible. But it’s not clear how to combine these with scholarly conventions on referencing and citations. Just to take Ethnography Matters as an example, for my article I used in-line linking and where I got to it parenthetical bibliographic information. Contrast with Heather Ford’s article in the same edition, which has no links and a section at the end for academic references. The APA has rules for citing web resources within an academic paper. What’s not clear is how directly linking citations within an academic hypertext document should work.

One reason for lack of consensus around this issue is that citation formatting is a pain in the butt. For off-line documents, word processing software has provided myriad tools for streamlining bibliographic work. But for publishing academic work on the web, we write in markup languages or WYSIWIG editors.

Since standards on the web tend to evolve through “rough consensus and running code”, I expect we’ll see a standard for this sort of thing emerge when somebody builds a tool that makes it easy for them to follow. This leads me back to fantasizing about the Dissertron. This is a bit disturbing. As much as I’d like to get away from studying Weird Twitter, I see now that a Weird Twitter ethnography is the perfect test-bed for such a tool precisely because of the hostile scrutiny it would attract.