Digifesto

Planning the Dissertron

In my PhD program, I’ve recently finished my coursework and am meant to start focusing on research for my dissertation. Maybe because of the hubbub around open access research, maybe because I still see myself as a ‘hacker’, maybe because it’s somehow recursively tied into my research agenda, or because I’m an open source dogmatic, I’ve been fantasizing about the tools and technology of publication that I want to work on my dissertation with.

For this project, which I call the Dissertron, I’ve got a loose bundle of requirements feature creeping its way into outer space:

  1. Incremental publishing of research and scholarship results openly to the web.
  2. Version control.
  3. Mathematical rendering a la LaTeX.
  4. Code highlighting a la the hacker blogs.
  5. In browser rendering of data visualizations with d3, where appropriate.
  6. Site is statically generated from elements on the file system, wherever possible.
  7. Machine readable metadata on the logical structure of the dissertation argument, which gets translated into static site navigation elements.
  8. Easily generated glossary with links for looking up difficult terms in-line (or maybe in-margin)
  9. A citation system that takes advantage of hyperlinking between resources wherever possible.
  10. Somehow, enable commenting. But more along the lines of marginalia comments (comments on particular lines or fragments of text) rather than blog comments. “Blog” style comments should be facilitated as notes on separately hosted dissertrons, or maybe a dissertron hub that aggregates and coordinates pollination of content between dissertrons.

This is a lot, and arguably just a huge distraction from working on my dissertation. However, it seems like this or something like it is a necessary next step in the advance of science and I don’t see how I really have much choice in the matter.

Unfortunately, I’m traveling, so I’m going to miss the PLOS workshop on Markdown for Science tomorrow. That’s really too bad, because Scholarly Markdown would get me maybe 50% of the way to what I want.

Right now the best tool chain I can imagine for this involves Scholarly Markdown, run using Pandoc, which I just now figured out is developed by a philosophy professor at Berkeley. Backing it by a Git repository would allow for incremental changes and version control.

Static site generation and hosting is a bit trickier. I feel like GitHub’s support of Jekyll make it a compelling choice, but hacking it to make it fit into the academic frame I’m thinking in might be more trouble than its worth. While it’s a bit of an oversimplification to say this, my impression is that at my university at least there is a growing movement to adopt Python as the programming language of choice for scientific computing. The exceptions seem to be people in the Computer Science department that are backing Scala.

(I like both languages and so can’t complain, except that it makes it harder to do interdisciplinary research if there is a technical barrier in their toolsets. As more of scientific research becomes automated, it is bound to get more crucial that scientific processes (broadly speaking) inter-operate. I’m incidentally excited to be working on these problems this summer for Berkeley’s new Social Science Data Lab. A lot of interesting architectural design is being masterminded by Aaron Culich, who manages the EECS department’s computing infrastructure. I’ve been meaning to blog about our last meeting for a while…but I digress)

Problem is, neither Python or Scala is Ruby, and Ruby is currently leading the game (in my estimate, somebody tell me if I’m wrong) in flexible and sexy smooth usable web design. And then there’s JavaScript, improbably leaking into the back end of the software stack after overflowing the client side.

So for the aspiring open access indie web hipster hacker science self-publisher, it’s hard to navigate the technical terrain. I’m tempted to string together my own rig depending mostly on Pandoc, but even that’s written in Haskell.

These implementation-level problems suggest that the problem needs to be pushed up a level of abstraction to the question of API and syntax standards around scientific web publishing. Scholarly Markdown can be a standard, hopefully with multiple implementations. Maybe there needs to be a standard around web citations as well (since in an open access world, we don’t need the same level of indirection between a document and the works it cites. Like blog posts, web publications can link to the content it derives from directly.)

POSSE homework: how to contribute to FOSS without coding

One of the assignments for the POSSE workshop is the question of how to contribute to FOSS when you aren’t a coder.

I find this an especially interesting topic because I think there’s a broader political significance to FOSS, but those that see FOSS as merely the domain of esoteric engineers can sometimes be a little freaked out by this idea. It also involves broader theoretical questions about whether or how open source jives with participatory design.

In fact, they have compiled a list of lists of ways to contribute to FOSS without coding: this, this, this, and this are provided in the POSSE syllabus.

Turning our attention from the question in the abstract, we’re meant to think about it in the context of our particular practices.

For our humanitarian FOSS project of choice, how are we interested in contributing? I’m fairly focused in my interests on open source participation these days: I’m very interested in the problem of community metrics and especially how innovation happens and diffuses within these communities. I would like to be able to build a system for evaluating that kind of thing that can be applied broadly to many projects. Ideally, it could do things like identify talented participants across multiple projects, or suggest interventions for making projects work better.

It’s an ambitious research project, but one for which there is plenty of data to investigate from the open source communities themselves.

What about teaching a course on such a thing? I anticipate that my students are likely to be interested in design as well as positioning their own projects within the larger open source ecosystem. Some of the people who I hope will take the class have been working on FuturePress, an open source e-book reading platform. As they grow the project and build the organization around it, they will want to be working with constituent technologies and devising a business model around their work. How can a course on Open Collaboration and Peer Production support that?

These concerns touch on so many issues outside of the consideration of software engineering narrowly (including industrial organization, communication, social network theory…) that it’s daunting to try to fit it all into one syllabus. But we’ve been working on one that has a significant hands-on component as well. Really I think the most valuable skill in the FOSS world is having the chutzpah to approach a digital community, propose what you are thinking, and take the criticism or responsibility that comes with that.

What concrete contribution a student uses to channel that energy should…well, I feel like it should be up to them. But is that enough direction? Maybe I’m not thinking concretely enough for this assignment myself.

POSSE and the FOSS field trip

I’m excited to be participating in POSSE — Professor’s Open Source Software Experience — this coming weekend in Philidelphia. It’s designed to train computer science professors how to use participation in open source communities to teach computer science. Somehow, they let me in as a grad student.

My goals are somewhat unusual for the program, I imagine. I’m not even in a computer science department. But I do have background in open source development, and this summer I’ll be co-teaching a course on Open Collaboration and Peer Production at Berkeley’s I School. We aren’t expecting all the students to be coders, so though there is a hands on component, it may be with other open collaborative projects like Wikipedia or OpenStreetMap. Though we aren’t filling a technical requirement (the class is aimed at Masters students), it will fulfill a management requirement. So you might say it’s a class on theory and practice of open collaborative community management.

My other education interest besides my own teaching is in the role of open source and computer science education in the larger economy. Lately there have been a lot of startups around programmer education. That makes sense, because demand for programming talent exceeds supply now, and so there’s an opportunity to train new developers. I’m curious whether it would be possible to build and market an on-line, MOOC-style programming course based on apprenticeship within open source communities.

One of our assignments to complete before the workshop is an open source community field trip. We’re suppose to check out a number of projects and get a sense of how to distinguish between strong and weak projects at a glance.

The first thing I noticed was that SourceForge is not keeping up with web user experience standards. That’s not so surprising, since as a FOSS practitioner I’m more used to projects hosted on GitHub and Google Code. Obviously, that hasn’t always been the case. But I’m beginning to think SourceForge’s role may now be mainly historic. Either that, or I have a heavy bias in my experience because I’ve been working with a not of newer, “webbier” code. Maybe desktop projects still have a strong SourceForge presence.

I was looking up mailing server software, because I’m curious about mining and modding mailing lists as a research area. Fundamentally, they seem like one of the lightest weight and most robust forms of on-line community out there, and the data is super rich.

Mailing lists on source Forge appear to be mainly in either Java, PHP, or some blend of Python and C. There are a couple Enterprise solutions. Several of the projects have moved their source code, hosted version control, and issue tracking off of SourceForge. Though the projects were ranked from “Inactive” through “Beta” to “Mature” and “Production/Stable”, usage of the tags was a little inconsistent across projects. Projects with a lot of weekly downloads tended to be either Mature or Production/Stable or both.

I investigated Mailman in particular. It’s an impressive piece of software; who knows how many people use Mailman mailing lists? I’m probably on at least ten myself. But it’s a fairly humble project in terms of its self-presentation and what people have done with it.

Turns out it has a lead developer, Barry Warsaw, who works at Canonical, and a couple other core committers, in addition to other contributors. There appears to be a v3.0 in production, which suddenly I’m pretty excited about.

POSSE has a focus on humanitarian FOSS projects. I’m not sure exactly how they define humanitarian, “that is, projects for which the primary purpose is to provide some social benefit such as economic development, disaster relief, health care, ecology. Examples include Mifos, Sahana and OpenMRS.”

For the purpose of this workshop I plan to look into Ushahaidi. I’ve heard so many good things about it, but frustratingly even after working four years on open source geospatial software, including a couple crowdsourced mapping apps, I never took a solid look at Ushahidi. Maybe because it was in PHP. I’m proud to say the project I put the most effort into, GeoNode, also has a humanitarian purpose. (GeoNode also now has a very pretty new website, and totally revamped user interface for its 2.0 release, now in alpha.) And though not precisely a software project, I’ve spent a lot of time admiring the intrepid Humanitarian OpenStreetMap Team for their use of open data as a humanitarian means and end.

But Ushahidi–there’s something you don’t see everyday.

We’re asked, on the POSSE field trip prompt, how we would decide whether our selected project was worth contributing to as an IT professional. The answer is: it depends on if I could do it for my job, but I’ve asked around about the project and community some and it seems like great people and usable software. I’d be proud to contribute to it, so at this point I expect my comparative advantage would be on the data analysis end (both of the community that builds it and data created by it) rather than to the core.

We were also asked to check out projects on Ohloh, which has also had a user interface revamp since I last looked carefully at it. Maybe significantly, we were asked to compare a number of different projects (two of them web browsers), but there was no feature on the website that provided a side-by-side comparison of the projects.

Also, one thing Ohloh doesn’t do yet is analytics on mailing lists. Which is odd, since that’s often where developers within a community get the most visceral sense of how large their community is (in my experience). Mailing lists wind up being the place where users can as participants affect software development, and where a lot of conflict resolution occurs. (Though there can be a lot of this on issue tracker discussions as well.)

This summer I hope to begin some more rigorous research into mailing list discussions and open source analytics. Seeing how Ohloh has moved forward reminds me I should be sharing my research with them. The focus of POSSE on project evaluation is encouraging–I’m curious to see where it goes next.

The recursive public as practice and imaginary

Chris Kelty’s Two Bits: The Cultural Significance of Free Software is one of the best synthetic histories of the Internet and Free Culture that I’ve encountered so far. Most exciting about it is his concept of the recursive public, the main insight of his extensive ethnographic work:

A recursive public is a public that is vitally concerned with the material and practical maintenance and modification of the technical, legal, practical, and conceptual means of its own existence as a public; it is a collective independent of other forms of constituted power and is capable of speaking to existing forms of power through the production of actually existing alternatives.

Speaking today about the book with Nick Doty and Ashwin Mathew, we found it somewhat difficult to tease out the boundaries of this concept. What publics aren’t recursive publics? And are the phenomena Kelty sometimes picks out by this concept (events in the history of Free Software) really examples of a public after all?

Just to jot down some thoughts:

  • If what makes the public is a social organization that contests other forms of institutional power (such as the state or the private sector), then there does seem to be an independence to the FOSS movement that makes the label appropriate. I believe this holds even when the organizations embodying this movement explicitly take part in state or commercial activities–as in resistance to SOPA, for example–though Ashwin seemed to think that was problematic.
  • I read recursion to refer to many aspects of this public. These include both the mutual reinforcement of its many components through time and the drive to extend its logic (e.g. the logic of open systems that originated in the IT sector in the 80′s) beyond its limits. If standards are open, then the source code should be next. If the source code is open, then the hardware is next. If the company’s aren’t open, then they’re next. Etc.

I find the idea of the recursive public compelling because it labels something aspirational: a functional unit of society that is cohesive despite its internal ideological diversity. However, it can be hard to tell whether Kelty is describing what he thinks is already the case or what he aspires for it to be.

The question is whether the recursive public is referring to the social imaginary of the FOSS movement or its concrete practices (which he lists: arguing about license, sharing source code, conceiving of the open, and coordinating collaboration). He does brilliant work in showing how the contemporary FOSS movement is a convergence of the latter. Misusing a term of Piaget’s, I’m tempted to call this an operational synthesis, analogous to how a child’s concept of time is synthesized through action from multiple phenomenological modalities. Perhaps it’s not irresponsible to refer to the social synthesis of a unified practice from varied origins with the same term.

Naming these practices, then, is a way of making them conscious and providing the imaginary with a new understanding of its situation.

Saskia Sassen in Territory, Authority, Rights notes that in global activism, action and community organization is highly local; what is global is the imagined movement in which one participates. Manuel Castells refers to this as the power of identity in social movements; the deliberate “reprogramming of networks” (of people) with new identities is a form of communication power that can exert political change.

It’s difficult for me to read Two Bits and not suspect Kelty of deliberately proposing the idea of a recursive public as an intellectual contribution to the self-understanding of the FOSS movement in a way that is inclusive of those that vehemently deny that FOSS is a movement. By identifying a certain set of shared practices as a powerful social force with its own logic in spite of and even because of its own internal ideological cacophony (libertarian or socialist? freedom or openness? fun, profit, or justice?), he is giving people engaged in those practices a kind of class consciousness–if they read his book.

That is good, because the recursive public is only one of many powers tussling over control of the Internet, and it’s a force for justice.

Ascendency and overhead in networked ecosystems

Ulanowicz (2000) proposes in information-theoretic terms several metrics for ecosystem health, where one models an ecosystem as a for example a trophic network. Principal among them ascendancy , which is a measure of the extent to which energy flows in the system are predictably structured weighted by the total energy of the system. He believes that systems tend towards greater ascendancy in expectation, and that this is predictive of ecological ‘succession’ (and to some extent ecological fitness). On the other hand, overhead, which is unpredictability (perhaps, inefficiency) in energy flows (“free energy”?), are important for the system’s resiliency towards external shocks.
ascendency
At least in the papers I’ve read so far, Ulanowicz is not mathematically specific about the mechanism that leads to greater ascendancy, though he sketches some explanations. Autocatalytic cycles within the network reinforce their own positive perturbations and mutations, drawing in resources from external sources, crowding out and competing with them. These cycles become agents in themselves, exerting what Ulanwicz suggests is Aristotelian final or formal causal power on the lower level components. In this way, freely floating energy is drawn into structures of increasing magnificence and complexity.

I’m reminded on Bataille’s The Accursed Share, in which he attempts to account for societal differences and the arc of human history through the use of its excess energy. “The sexual act is in time what the tiger is in space,” he says, insightfully. The tiger, as an apex predator, is flame that clings brilliantly to the less glamorous ecosystem that supports it. That is why we adore them. And yet, their existence is fragile, as it depends on both the efficiency and stability of the rest of its network. When its environment is disturbed, it is the first to suffer.
space tiger
Ulanowicz cites himself suggesting that a similar framework could be used to analyze computer networks. I have not read his account yet, though I anticipate several difficulties. He suggests that data flows in a computer network are analogous to energy flows within an ecosystem. That has intuitive appeal, but obscures the fact that some data is more valuable than others. A better analogy might be money as a substitute for energy. Or maybe there is a way to reduce both to a common currency, at least for modeling purposes.

Econophysics has been gaining steam, albeit controversially. Without knowing anything about it but based just on statistical hunches, I suspect that this comes down to using more complex models on the super duper complex phenomenon of the economy, and demonstrating their success there. In other words, I’m just guessing that the success of econophysics modeling is due to the greater degrees of freedom in the physics models compared to non-dynamic, structural equilibrium models. However, since ecology models the evolutionary dynamics of multiple competing agents (and systems of those agents), its possible that those models could capture quite a bit of what’s really going on and even be a source of strategic insight.

Indeed, economics already has a sense of stable versus unstable equilibria that resonate with the idea of stability of ecological succession. These ideas translate into game theoretic analysis as well. As we do more work with Strategic Bayesian Networks or other constructs to model equilibrium strategies in a networked, multi-agent system, I wonder if we can reproduce Ulanowicz’s results and use his ideas about ascendancy (which, I’ve got to say, are extraordinary and profound) to provide insight into the information economy.

I think that will require translating he ecosystem modeling into Judea Pearl’s framework for causal reasoning. Having been indoctrinated in Pearl’s framework in much of my training, I believe that it is general enough to subsume Ulanowicz’s results. But I have some doubt. In some of his later writings Ulanowicz refers explicitly to a “Hegelian dialectic” between order and disorder as a consequence of some of his theories, and between that and his insistence on his departure from mechanistic thinking over the course of his long career, I am worried that he may have transcended what it’s possible to do even with the modeling power of Bayesian networks. The question is: what then? It may be that once one’s work sublimates beyond our ability to model explicitly and intervene strategically, it becomes irrelevant. (I get the sense that in academia, Ulanwicz’s scientific philosophizing is a privilege reserved for someone tenured who late in their career is free to make his peace with the world in their own way) But reading his papers is so exhilarating to me. I’ve had no prior exposure to ecology before this, so his papers are packed with fresh ideas. So while I don’t know how to justify it to any of my mentors or colleagues, I think I just have to keep diving into it when I can, on the side.

@#$%! : variance annotations in Scala’s unsound parameterized types

[error] /home/sb/ischool/cs294/hw3/src/main/scala/TestScript.scala:32: type mismatch;
[error] found : Array[wikilearn.TestScript.parser.Page]
[error] required: Array[wikilearn.WikiParser#Page]
[error] Note: wikilearn.TestScript.parser.Page <: wikilearn.WikiParser#Page, but class Array is invariant in type T.
[error] You may wish to investigate a wildcard type such as `_ <: wikilearn.WikiParser#Page`. (SLS 3.2.10)

wtf, Scala.  You know exactly what I’m trying to do here.

EDIT: I sent a link to the above post to David Winslow. He responded with a crystal clear explanation that was so great I asked him if I could include it here. This is it, below:

It’s a feature, not a bug :) This is actually the specific issue that Dart had in mind when they put this note in the language spec:

The type system is unsound, due to the covariance of generic types. This is a deliberate choice (and undoubtedly controversial). Experience has shown that sound type rules for generics fly in the face of programmer intuition. It is easy for tools to provide a sound type analysis if they choose, which may be useful for tasks like refactoring.

Which of course caused some hubbub among the static typing crowd.

The whole issue comes down to the variance annotations of type parameters Variance influences how type parameters relate to the subtyping relationships of parameterized types:

Given types A and B, A is a supertype of B
trait Invariant[T] means there is no subtype relationship between Invariant[A] and invariant[B]. (Either could be used as an Invariant[_] though)
trait Covariant[+T] means Covariant[A] is a supertype of Covariant[B]
trait Contravariant[-T] means Contravariant[A] is a subtype of Contravariant[B].

The basic rule of thumb is that if you produce values of type T, you can be covariant in T, and if you consume values of type U, you can be contravariant in type U. For example, Function1 has two type parameters, the parameter type A and the result type T. it is contravariant in A and covariant in T. An (Any => String) can be used where a (String => Any) is expected, but not the other way around.

So, what about the type parameter for Array[T]? Among other operations, Arrays provide:

class Array[T] {
  def apply(i: Int): T // "producing" a T
  def update(i: Int, t: T): Unit // "consuming" a T
}

When the type parameter appears in contravariant and covariant positions the only option is to make it invariant.

Now, it’s interesting to note that in the Java language Arrays are treated as if they are covariant. This means that you can write a Java program that doesn’t use casts, passes the typechecker, and generates a type error at runtime; the body of main() would look like:

String[] strings = new String[1];
Object[] objects = strings;
objects[0] = Integer.valueOf(0); // the runtime error occurs at this step, but even if it didn't: 
System.out.println(strings[0]); // what happens here?

Anyway, the upshot is that immutable collections only use their types in covariant positions (you can get values out, but never insert) so they are much handier. Does your code work better if you replace your usage of Array with Vector? Alternatively, you can always provide the type parameter when you construct your array. Array(“”) is an Array[String], but Array[AnyRef](“”) is an Array[AnyRef].

Bash script for converting all .wav files in a directory to .mp3

I’ve been working with music files lately trying to get Steve Morrell‘s music online. In the process I’ve had to convert his albums, which I’ve ripped in .wav format, to .mp3.

To accomplish this, I’ve written a short bash script. It’s requires a number of tricks I wasn’t familiar with and had to look up.

#!/bin/bash

SAVEIF=$IFS
IFS=$(echo -en "\n\b")

for file in $(ls *wav)
do
  name=${file%%.wav}
  lame -V0 -h -b 160 --vbr-new $name.wav $name.mp3
done


IFS=$SAVEIFS

Though it isn’t recommended, I did the for loop on ls because I wanted to limit it to .wav files. But that means the script chokes on file names with spaces unless you swap out the IFS variable.

I used LAME for the conversion.

Hadoop with Scala: hacking notes

I am trying to learn how to use Hadoop. I’m am trying to learn to program in Scala. I mostly forget how to program in Java. In this post I will take notes on things that come up as I try to get my frickin’ code to compile so I can run a Hadoop job.

There was a brief window in my life when I was becoming a good programmer. It was around the end of my second year as a professional software engineer that I could write original code to accomplish novel tasks.

Since then, the tools and my tasks have changed. For the most part, my coding has been about projects for classes, really just trying to get a basic competence in commodity open tools. So, my “programming” consists largely of cargo-culting code snippets and trying to get them to run in a slightly modified environment.

Right now I’ve got an SBT project; I’m trying to write a MapReduce job in Scala that will compile as a .jar that I can run on Hadoop.

One problem I’m having is there are apparently several different coding patterns for doing this, and several frameworks that are supposed to make my life easier. These include SMR, Shadoop, and Scalding. But since I’m doing this for a class and I actually want to learn something about how Hadoop works, I’m worried about having to good a level of abstraction.

So I’m somewhat perversely taking the Scala Wordcount example from jweslley’s Shadoop and make it dumber. I.e., not use Shadoop.

One thing that has been confusing as hell is that there Hadoop has a Mapper interface and a Mapper class, both with map() functions (1,2), but those functions haved different type signatures.

I started working with some other code that used the second map() function. One of the arguments to this function is of type Mapper.Context. I.e., the Context class is a nested member of the Mapper class. Unfortunately, referencing this class within Scala is super hairy. I saw a code snippet that did this:

override def map(key:Object, value:Text, context:Mapper[Object,Text,Text,IntWritable]#Context) = {
    for (t <-  value.toString().split("\\s")) {
      word.set(t)
      context.write(word, one)
    }
  }

But I couldn’t get this thing to compile. Kept getting this awesome error:

type Context is not a member of org.apache.hadoop.mapred.Mapper[java.lang.Object,org.apache.hadoop.io.Text,org.apache.hadoop.io.Text,org.apache.hadoop.io.IntWritable]

Note the gnarliness here. It’s not super clear whether or how Context is parameterized by the type parameters of Mapper. The docs for the Mapper class make it seem like you can refer to Context without type parameterization within the code of the class extending Mapper. But I didn’t see that until I had deleted everything and tried a different track, which was to use the Mapper interface in a class extending MapReduceBase.

Oddly, this interface hides the Context mechanic and instead introduces the Reporter class as a final argument to map(). I find this less intimidating for some reason. Probably because after years of working in Python and JavaScript my savvinness around the Java type hierarchy is both rusty and obsolete. With the added type magicalness of Scala to add complexity to the mix, I think I’ve got to steer towards the dumbest implementation possible. And at the level I’m at, it looks like I don’t ever have to touch or think about this Reporter.

So, now starting with the example from Shadoop, now I just need to decode the Scala syntactic sugar that Shadoop provides to figure out what the hell is actually going on.

Consider:

  class Map extends MapReduceBase with Mapper[LongWritable, Text, Text, IntWritable] {

    val one = 1

    def map(key: LongWritable, value: Text, output: OutputCollector[Text, IntWritable], reporter: Reporter) =
      (value split " ") foreach (output collect (_, one))
  }

This is beautiful concise code. But since I want to know something about the underlying program I’m going to uglify it by removing the implict conversions provided by Shadoop.

The Shadoop page provides a Java equivalent for this, but that’s not really what I want either. For some reason I demand the mildy more concise syntax of Scala over Java but not the kind of condensed, beautiful syntax Scala makes possible with additional libraries.

This compiles at least:

  class Map extends MapReduceBase with Mapper[LongWritable, Text, Text,
 IntWritable] {   

    val one = new IntWritable(1); 

    def map(key: LongWritable, value: Text, output: OutputCollector[Text,
     IntWritable], reporter: Reporter) = {
      var line = value.toString();

      for(word <- line.split(" ")){
        output.collect(new Text(word), one)
      }
    }
  }

What I find a little counterintuitive about this is that the OutputCollector doesn’t act like a dictionary, overwriting the key-value pair with each call to collect(). I guess since I’m making a new Text object with each new entry, that makes sense even if the collector is implemented as a hash map of some kind. (Shadoop hides this mechanism with implicit conversions, which is rad of course.)

Next comes the reducer. The Shadoop code is this:

def reduce(key: Text, values: Iterator[IntWritable],
            output: OutputCollector[Text, IntWritable], reporter: Reporter) = {
  val sum = values reduceLeft ((a: Int, b: Int) => a + b)
  output collect (key, sum)
}

Ok, so there’s a problem here. The whole point of using Scala to code a MapReduce job is so that you can use Scala’s built in reduceLeft function inside the reduce method of the Reducer. Because functional programming is awesome. By which I mean using built-in functions for things like map and reduce operations are awesome. And Scala supports functional programming, in at the very least that sense. And MapReduce as a computing framework is at least analogous to that paradigm in functional programming, and even has the same name. So, OMG.

Point being, no way in hell am I going to budge on this minor aesthetic point in my toy code. Instead, I’m going to brazenly pillage jweslley’s source code for the necessary implicit type conversion.

  implicit def javaIterator2Iterator[A](value: java.util.Iterator[A]) = new Iterator[A] {
    def hasNext = value.hasNext
    def next = value.next
  }

But not the other implicit conversions that would make my life easier. That would be too much.

Unfortunately, I couldn’t get this conversion to work right. Attempting to run the code gives me the following error:

[error] /home/cc/cs294/sp13/class/cs294-bg/hw3/wikilearn/src/main/scala/testIt/WordCount.scala:33: type mismatch;
[error]  found   : java.util.Iterator[org.apache.hadoop.io.IntWritable]
[error]  required: Iterator[org.apache.hadoop.io.IntWritable]
[error]       val sum = (values : scala.collection.Iterator[IntWritable]).reduceLeft (
[error]                  ^

It beats me why this doesn’t work. In my mental model of how implicit conversion is supposed to work, the java.util.Iterator[IntWritable] should be caught by the parameterized implicit conversion (which I defined within the Object scope) and converted no problemo.

I can’t find any easy explanation of this on-line at the moment. I suspect it’s a scoping issue or a limit to the parameterization of implicit conversions. Or maybe because Iterator is a trait, not a class? Instead I’m going to do the conversion explicitly in the method code.

After fussing around for a bit, I got:

    def reduce(key: Text, values: java.util.Iterator[IntWritable],                                                               
      output: OutputCollector[Text, IntWritable], reporter: Reporter) = {                                                        
      val svals = new scala.collection.Iterator[IntWritable]{
        def hasNext = values.hasNext
        def next = values.next
      }
      val sum = (svals : scala.collection.Iterator[IntWritable])\|
.reduceLeft (
       (a: IntWritable, b: IntWritable) => new IntWritable(a.get() + b.get())}
      ) 
      output collect (key, sum)
    }

…or, equivalently and more cleanly:

    def reduce(key: Text, values: java.util.Iterator[IntWritable],
      output: OutputCollector[Text, IntWritable], reporter: Reporter) = {

      val svals = new scala.collection.Iterator[Int]{
        def hasNext = values.hasNext
        def next = values.next.get
      }

      val sum = (svals : scala.collection.Iterator[Int]).reduceLeft (
        (a: Int, b: Int) => a + b
      )
      output collect (key, new IntWritable(sum))
    }
  }

I find the Scala syntax for defining the methods of an abstract class here pretty great (I hadn’t encountered it before). Since Iterator[A] is an abstract class, you define the methods next and hasNext inside the curly braces. What an elegant way to let people subclass abstract classes in an ad hoc way!

There’s one more compile error I had to bust around. This line was giving me noise:

conf setOutputFormat classOf[TextOutputFormat[_ <: WritableComparable, _ <: Writable]]

It was complaining that WriteComparable needed a type parameter. Not confident I could figure out exactly which parameter to set, I just made the signature tighter.

conf setOutputFormat classOf[TextOutputFormat[Text, IntWritable]]

Only then did I learn that JobConf is a deprecated way of defining jobs. So I rewrote WordCount object into a class implementing the Tool interface, using this Java snippet as an example to work from. To do that, I had to learn the to write a class that extends two interfaces in Scala, you need to use a “extends X with Y” syntax. Also, for trivial conditionals Scala dispenses with Java’s ternary X ? Y : Z operator in favor of a single line if (X) Y else Z. Though I will miss the evocative use of punctuation in the ternary construct, I’ve got to admit that Scala is keeping it real classy here.

Wait…ok, so I just learned that most of the code I was cargo culting was part of the deprecated coding pattern, which means I now have to switch it over to the new API. I learned this from somebody helpful in the #hadoop IRC channel.

[23:31]  what's the deal with org.apache.hadoop.mapreduce.Mapper and org.apache.hadoop.mapred.Mapper ??
[23:31]  is one of them deprecated?
[23:31]  which should I be using?
[23:32]  sbenthall: Use the new API
[23:32]  sbenthall: (i.e. mapreduce.*) both are currently supported but eventually the (mapred.*) may get deprecated
[23:32]  ok thanks QwertyM
[23:33]  sbenthall: as a reference, HBase uses mapreduce.* APIs completely for its provided MR jobs; and I believe Pig too uses the new APIs
[23:33]  is MapReduceBase part of the old API?
[23:33]  sbenthall: yes, its under the mapred.* package
[23:33]  ok, thanks.

Parachuting into the middle of a project has it’s drawbacks, but it’s always nice when a helpful community member can get you up to speed. Even if you’re asking near midnight on a Sunday.

Wait. I realize now that I’ve come full circle.

See, I’ve been writing these notes over the course of several days. Only just now am I realizing that I’m not going back to where I started, with the Mapper class that takes the Context parameter that was giving me noise.

Looking back at the original error, it looks like that too was a result of mixing two API’s. So maybe I can now safely shift everything BACK to the new API, drawing heavily on this code.

It occurs to me that this is one of those humbling programming experiences when you discover that the reason why your thing was broken was not the profound complexity of the tool you were working with, but your own stupidity over something trivial. This happens to me all the time.

Thankfully, I can’t ponder that much now, since it’s become clear that the instructional Hadoop cluster on which we’ve been encouraged to do our work are highly unstable. So I’m going to take the bet that it will be more productive for me to work locally, even if that means installing Hadoop locally on my Ubuntu machine.

I thought I was doing pretty good with the installation until I got to the point of hitting the “ON” switch on Hadoop. I got this:

sb@lebensvelt:~$ /usr/local/hadoop/bin/start-all.sh 
Warning: $HADOOP_HOME is deprecated.

starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-sb-namenode-lebensvelt.out
localhost: ssh: connect to host localhost port 22: Connection refused
localhost: ssh: connect to host localhost port 22: Connection refused
starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-sb-jobtracker-lebensvelt.out
localhost: ssh: connect to host localhost port 22: Connection refused

I googled around and it looks like this problem is due to not having an SSH server running locally. Since I’m running Ubuntu, I went ahead and followed these instructions. In the process I managed to convince my computer that I was undergoing a man-in-the-middle attack between myself and myself.

I fixed that with

$ ssh-keygen -R localhost

and successfully got Hadoop running with

$ /usr/local/hadoop/bin/start-all.sh 

only to be hung up on this error

$ hadoop fs -ls
Warning: $HADOOP_HOME is deprecated.

ls: Cannot access .: No such file or directory.

which somebody who runs an Indian matrimony search engine had run into and documented the fix for. (Right way to spell it is

hadoop fs -ls .

With an extra dot.)

There’s a point to me writing all this out, by the way. An important part of participation in open source software, or the hacker ethic in general, is documenting ones steps so that others who follow the same paths can benefit from what you’ve gone through. I’m going into a bit more detail about this process than really helpful because in my academic role I’m dealing with a lot of social scientist types who really don’t know what this kind of work entails. Let’s face it: programming is a widely misunderstood discipline which seems like an utter mystery to those that aren’t deeply involved in it. Much of this has to do with the technical opacity of the work. But another part of why its misunderstood is because problem solving in the course of development depends on a vast and counter-intuitive cyberspace of documentation (often generated from code comments, so written by some core developer), random blog posts, chat room conversations, forum threads. Easily 80% of the work when starting out on a new project like this is wrestling with all the minutia of configuration on a particular system (operating system and hardware contingent, in many cases) and idioms of programming language and environment.

The amount of time it takes to invest in any particular language or toolkit necessarily creates a tribalism among developers because their identities wind up being intertwined with the tools they use. As I hack on this thing, however incompetently, I’m becoming a Scala developer. That’s similar to saying that I’m becoming a German speaker. My conceptual vocabulary, once I’ve learned how to get things done in Scala, is going to be different than it was before. In fact, that’s one of the reasons why I’m insisting on teaching myself Scala in the first place–because I know that it is a conceptually deep and rigorous language which will have something to teach me about the Nature of Things.

Some folks in my department are puzzled at the idea that technical choices in software development might be construed as ethical choices by the developers themselves. Maybe it’s easier to understand that if you see that in choosing a programming language you are in many ways choosing an ontology or theoretical framework through which to conceive of problem-solving. Of course, choice of ontology will influence ones ethical principles, right?

But I digress.

So I have Hadoop running on my laptop now, and a .jar file that compiles in SBT. So now all I need to do is run the .jar using the hadoop jar command, right?

Nope, not yet…

Exception in thread "main" java.lang.NoClassDefFoundError: scala/ScalaObject

OK, so I the problem is that I haven’t included scala-library.jar on my Hadoop runtime classpath.

I solved this by making a symbolic link from the Hadoop /lib directory to the .jar in my Scala installation.

ln -s /usr/local/share/scala-2.9.2/lib/scala-library.jar /usr/local/hadoop/lib/scala-library.jar

That seemed to work, only now I have the most mundane and inscrutable of Java errors to deal with:

Exception in thread "main" java.lang.NullPointerException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:601)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

I had no idea how to proceed from here. This time, no helpful folks in the #hadoop channel helped me either.

So once again I switched to a new hunk of code to work from, this time the WordCount.scala file from Derrick Cheng’s ScalaOnHadoop project. Derrick posted a link to this project on our course’s Piazza earlier, which was awesome.

Another digression. There’s a lot of talk now about on-line education. People within the university context feel vaguely threatened by MOOCs, believing that there will be a superstar effect that advantages first movers, but many are uncomfortable making that first move. Taek Lim in my department is starting to studying user interfaces to support collaboration in on-line learning.

My own two cents on this are that the open software model is about as dynamics a collaborative environment as you can get, and at the point when people started to use our course discussion management system, Piazza, as if it were a forum to discuss the assignment almost as if it was an open source mailing list, we started to get a lot more out of it and learned a lot from each other. I’m not the first person to see this potential of the open source model for education, of course. I’m excited to be attending the POSSE workshop, which is about that intersection, this summer. At this rate, it looks like I will be co-teaching a course along these lines in the Fall targeted at the I School Masters students, which is exciting!

So, anyway, I’m patching Derrick Cheng’s code. I’m not working with BIDMat yet so I’m leaving that out of the build, so I remove all references to that and I get the thing to compile…and run! I got somebody else’s Scala WordCount to run!

This seems like such a triumph. It’s taken a lot of tinkering and beating my head against a wall to get this far. (Not all at once–I’ve wrote this post over the course of several days. I realize that it’s a little absurd.)

But wait! I’m not out of the woods yet. I check the output of my MapReduce job with

hadoop fs -cat test-output/part-r-00000

and the end of my output file looks like this:

§	1
§	1
§	1
§	1
§	1
§	1
§	1
§	1
§	1
Æschylus	1
Æsop	1
Æsop	1
Æsop	1
É	1
Élus,_	1
à	1
à	1
à	1
æons	1
æsthetic	1
è	1
è	1
è	1
état_.	1
The	1
The	1

What’s going on here? Well, it looks like I successfully mapped each occurrence of each word in the original text to a key-value pair. But something went wrong in the reduce step, that was supposed to combine all the occurrences into a single count for each word.

That is just fine for me, because I’d rather be using my own Reduce function. Because it uses Scala’s functional reduceLeft, which is sweet! Why even write a Map Reduce job in a functional programming language if you can’t use a built=in language reduce in the Reduce step?

Ok, mine doesn’t work either.

Apparently, the reason for this is that the type signature I’ve been using for the Reducer’s reduce method has been wrong all along. And when that happens, the code compiles but Reducer runs its default reduce function, which is the identity function.

It’s almost (almost!) as if it would have made more sense to start by just reading the docs and following the instructions.

Now I have edited the map and reduce functions so that they have the right type signatures. To get this right exactly, I looked at a different file. I also tinkered

At last, it works.

Now, at last, I understand how this Context member class works. The problem was that I was trying to use it with the mapred.Mapper class from the old API. So much of my goose chase was due to not reading things carefully enough.

On the other hand, I feel enriched by the whole meandering process. Realizing that my programming faults were mine and not due to the complexity of the tools I was using paradoxically gives me more confidence in my understanding of the tools moving forward. And engaging with the distributed expertise on the subject–through StackOverflow, documentation generated from the original coders, helpful folks on IRC, blog posts, and so on–is much more compelling when one is driven by concrete problem-solving goals, even when those goals are somewhat arbitrary. Had I learned to use Hadoop in a less circuitous way, my understanding would probably be much more brittle. I am integrating new knowledge of Hadoop, Scala, and Java (it’s been a long time) with existing background knowledge. After a good night’s sleep, with any luck it will be part of my lifeworld!

This is the code I wound up with, by the way. I don’t suggest you use it.

the technical political spectrum?

Since the French Revolution, we have had the Left/Right divide in politics.

Probably seven or so years ago, some people got excited about thinking about a two-dimensional political spectrum. There were Economic and Social dimensions. You could be in one of four quadrants: Libertarian, Social Democrat, Totalitarian, or Conservative.

Technology is getting more political and politicized. Have we figured out the spectrum yet?

Because there’s been a lot of noise about their beef, lets assume as a first pass that O’Reilly and Morozov give us some sense of the space. The problem is that there’s a good chance the “debate” between them is giving off a lot more heat than light, so it’s not clear if there’s a substantive political difference.

Let me try to take a constructive crack at it. I don’t think I’m going to get it right, but I’m curious to know how much this resonates and if others would map things differently.

A two-dimensional representation of the continuum of technical politics, with unscientifically plotted representatives

A two-dimensional representation of the continuum of technical politics, with unscientifically plotted representatives

Some people think that “technology”, by which most people mean technology companies, should be replacing more and more of the functions of government. I think the peer progressives are in this camp, as are the institutionalized nudgers in the UK Conservative party, who would prefer to shrink the state. There’s a fair argument that the “open government” people are trying to shrink government by giving non-state actors the ability to provide services that the state might otherwise provide. Through free flow of information and greater connectivity, we can spur vibrancy in civil society and perfect the market.

Others think that the state needs to have a strong role in regulating technology companies to make sure they don’t abuse their power. There’s a lot of that going around in my department at UC Berkeley. These people see that democratic state as the best representative of citizen’s interests. The FTC and Congress need to help ensure, e.g., people’s privacy. Maybe Morozov is in here somewhere. Monopoly concentrations of technical power are threatening to the public interest; technical platforms should be decentralized and controlled so that politics is not overwhelmed by an illegitimate technocracy.

Another powerful group, the Copyright lobby, is economically threatened by new technology and so wants to restrict its use. Telecom companies would like to effectively meter flow of information. Maybe it’s a stretch, but perhaps we could include the military-industrial complex and its desire to instrument the Web for surveillance purposes in this camp as well. These groups tend to not want technology to change, or to tightly control that technology.

Then there’s the Free Software movement. And Stanford’s Liberation Technology folks, if I understand them correctly. And maybe Anonymous is in here somewhere. Pro-technology, generally skeptical of both state and corporate interests.

So maybe what’s going on is that we have a two-dimensional political space.

In one dimension, we have Centralization versus Decentralization. Richly interconnected platforms managed by an elite with tight arrangements for data sharing, versus a much more loosely connected set of networks where the lines of power are less clear.

In the other dimension, we have Unrestricted versus Controlled. Either the technical organizations should be free to persue their own interests, or they should be regulated by non- (or at least less) technical political forces, such as the state.

What do you think?

the social intelligence of spotted hyenas

The best thing I did today was stop by for the beginning of Kay Holekamp‘s talk on “Social Complexity and the Evolution of Intelligence.”

Her work involves researching spotted hyenas.

Spotted hyenas live in clans of about a hundred hyenas, which contain several martilineal kinship groups each. Female hyenas have an observable social hierarchy that is caused by and a cause of survival “fitness”.  Male hyenas migrate to a different clan before reproducing.

This is very similar to the social structure of certain primates, like baboons.  It is nothing like the social structure of cats and dogs (hyenas are somewhere in between the two, closer to cats.)

What’s interesting about the research is that without exception, results about the social cognitive capabilities of primates is, without exception, reproducible in spotted hyenas.

That means that the same capacities for social intelligence has been achieved by multiple species through convergent evolution.

Follow

Get every new post delivered to your Inbox.

Join 29 other followers