Digifesto

Category: programming

programming and philosophy of science

Philosophy of science is a branch of philosophy largely devoted to the demarcation problem: what is science?

I’ve written elsewhere about why and how in the social sciences, demarcation is highly politicized and often under attack. This is becoming pertinent now especially as computational methods become dominant across many fields and challenge the bases of disciplinary distinction. Today, a lot of energy (at UC Berkeley at least) goes into maintaining the disciplinary social sciences even when this creates social fields that are less scientific than they could be in order to maintain atavistic disciplinary traits.

Other energy (also at UC Berkeley, and elsewhere) goes into using computer programs to explore data about the social world in an undisciplinary way. This isn’t to say that specific theoretical lenses don’t inform these studies. Rather, the lenses are used provisionally and not in an exclusive way. This lack of disciplinary attachment is an important aspect of data science as applied to the social world.

One reason why disciplinary lenses are not very useful for the practicing data scientist is that, much like natural scientists, data scientists are more often than not engaged in technical inquiry whose purpose is prediction and control. This is very different from, for example, engaging an academic community in a conversation in a language they understand or that pays appropriate homage to a particular scholarly canon–the sort of thing one needs to do to be successful in an academic context. For much academic work, especially in the social sciences, the process of research publication, citation, and promotion is inherently political.

These politics are more often than not not an essential function to scientific inquiry itself; rather they have to do with the allocation of what Bourdieu calls temporal capital: grant funding, access, appointments, etc. within the academic field. Scientific capital, that symbolic capital awarded to scientists based on their contributions to trans-historical knowledge, is awarded more based on the success of an idea than by, for example, brown-nosing ones superiors. However, since temporal capital in the academy is organized by disciplines as a function of university bureaucratic organization, academic researchers are required to contort themselves to disciplinary requirements in the presentation of their work.

Contrast this with the work of analysing social data using computers. The tools used by computational social scientists tend to be products of the exact sciences (mathematics, statistics, computer science) with no further disciplinary baggage. The intellectual work of scientifically devising and testing theories against the data happens in a language most academic communities would not recognize as a language at all, and certainly not their language. While this work depends on the work of thousands of others who have built vast libraries of functional code, these ubiquitous contributors are not included in an social science discipline’s scholarly canon. They are uncited, taken for granted.

However, when those libraries are made openly available (and they often are), they participate in a larger open source ecosystem of tools whose merits are judged by their practical value. Returning to our theme of the demarcation problem, the question is: is this science?

I would answer: emphatically yes. Programming is science because, as Peter Naur has argued, programming is theory building (hat tip the inimitable Spiros Eliopoulos for the reference). The more deeply we look into the demarcation problem, the more clearly software engineering practice comes into focus as an extension of a scientific method of hypothesis generation and testing. Software is an articulation of ideas, and the combined works of software engineers are a cumulative science that has extended far beyond the bounds of the university.

The node.js fork — something new to think about

For Classics we are reading Albert Hirschman’s Exit, Voice, and Loyalty. Oddly, though normally I hear about ‘voice’ as an action from within an organization, the first few chapters of the book (including the introduction of the Voice concept itselt), are preoccupied with elaborations on the neoclassical market mechanism. Not what I expected.

I’m looking for interesting research use cases for BigBang, which is about analyzing the sociotechnical dynamics of collaboration. I’m building it to better understand open source software development communities, primarily. This is because I want to create a harmonious sociotechnical superintelligence to take over the world.

For a while I’ve been interested in Hadoop’s interesting case of being one software project with two companies working together to build it. This is reminiscent (for me) of when we started GeoExt at OpenGeo and Camp2Camp. The economics of shared capital are fascinating and there are interesting questions about how human resources get organized in that sort of situation. In my experience, there becomes a tension between the needs of firms to differentiate their products and make good on their contracts and the needs of the developer community whose collective value is ultimately tied to the robustness of their technology.

Unfortunately, building out BigBang to integrate with various email, version control, and issue tracking backends is a lot of work and there’s only one of me right now to both build the infrastructure, do the research, and train new collaborators (who are starting to do some awesome work, so this is paying off.) While integrating with Apache’s infrastructure would have been a smart first move, instead I chose to focus on Mailman archives and git repositories. Google Groups and whatever Apache is using for their email lists do not publish their archives in .mbox format, which is pain for me. But luckily Google Takeout does export data from folks’ on-line inbox in .mbox format. This is great for BigBang because it means we can investigate email data from any project for which we know an insider willing to share their records.

Does a research ethics issue arise when you start working with email that is openly archived in a difficult format, then exported from somebody’s private email? Technically you get header information that wasn’t open before–perhaps it was ‘private’. But arguably this header information isn’t personal information. I think I’m still in the clear. Plus, IRB will be irrelevent when the robots take over.

All of this is a long way of getting around to talking about a new thing I’m wondering about, the Node.js fork. It’s interesting to think about open source software forks in light of Hirschman’s concepts of Exit and Voice since so much of the activity of open source development is open, virtual communication. While you might at first think a software fork is definitely a kind of Exit, it sounds like IO.js was perhaps a friendly fork of just somebody who wanted to hack around. In theory, code can be shared between forks–in fact this was the principle that GitHub’s forking system was founded on. So there are open questions (to me, who isn’t involved in the Node.js community at all and is just now beginning to wonder about it) along the lines of to what extent a fork is a real event in the history of the project, vs. to what extent it’s mythological, vs. to what extent it’s a reification of something that was already implicit in the project’s sociotechnical structure. There are probably other great questions here as well.

A friend on the inside tells me all the action on this happened (is happening?) on the GitHub issue tracker, which is definitely data we want to get BigBang connected with. Blissfully, there appear to be well supported Python libraries for working with the GitHub API. I expect the first big hurdle we hit here will be rate limiting.

Though we haven’t been able to make integration work yet, I’m still hoping there’s some way we can work with MetricsGrimoire. They’ve been a super inviting community so far. But our software stacks and architecture are just different enough, and the layers we’ve built so far thin enough, that it’s hard to see how to do the merge. A major difference is that while MetricsGrimoire tools are built to provide application interfaces around a MySQL data backend, since BigBang is foremost about scientific analysis our whole data pipeline is built to get things into Pandas dataframes. Both projects are in Python. This too is a weird microcosm of the larger sociotechnical ecosystem of software production, of which the “open” side is only one (important) part.

Imre Lakatos and programming as dialectic

My dissertation is about the role of software in scholarly communication. Specifically, I’m interested in the way software code is itself a kind of scholarly communication, and how the informal communications around software production represent and constitute communities of scientists. I see science as a cognitive task accomplished by the sociotechnical system of science, including both scientists and their infrastructure. Looking particularly at scientist’s use of communications infrastructure such as email, issue trackers, and version control, I hope to study the mechanisms of the scientific process much like a neuroscientist studies the mechanisms of the mind by studying neural architecture and brainwave activity.

To get a grip on this problem I’ve been building BigBang, a tool for collecting data from open source projects and readying it for scientific analysis.

I have also been reading background literature to give my dissertation work theoretical heft and to procrastinate from coding. This is why I have been reading Imre Lakatos’ Proofs and Refutations (1976).

Proofs and Refutations is a brilliantly written book about the history of mathematical proof. In particular, it is an analysis of informal mathematics through an investigation of the letters written by mathematicians working on proofs about the Euler characteristic of polyhedra in the 18th and 19th centuries.

Whereas in the early 20th century, based on the work of Russel and Whitehead and others, formal logic was axiomatized, prior to this mathematical argumentation had less formal grounding. As a result, mathematicians would argue not just substantively about the theorem they were trying to prove or disprove, but also about what constitutes a proof, a conjecture, or a theorem in the first place. Lakatos demonstrates this by condensing 200+ years of scholarly communication into a fictional, impassioned classroom dialog where characters representing mathematicians throughout history banter about polyhedra and proof techniques.

What’s fascinating is how convincingly Lakatos presents the progress of mathematical understanding as an example of dialectical logic. Though he doesn’t use the word “dialectical” as far as I’m aware, he tells the story of the informal logic of pre-Russellian mathematics through dialog. But this dialog is designed to capture the timeless logic behind what’s been said before. It takes the reader through the thought process of mathematical discovery in abbreviated form.

I’ve had conversations with serious historians and ethnographers of science who would object strongly to the idea of a history of a scientific discipline reflecting a “timeless logic”. Historians are apt to think that nothing is timeless. I’m inclined to think that the objectivity of logic persists over time much the same way that it persists over space and between subjects, even illogical ones, hence its power. These are perhaps theological questions.

What I’d like to argue (but am not sure how) is that the process of informal mathematics presented by Lakatos is strikingly similar to that used by software engineers. The process of selecting a conjecture, then of writing a proof (which for Lakatos is a logical argument whether or not it is sound or valid), then having it critiqued with counterexamples, which may either be global (counter to the original conjecture) or local (counter to a lemma), then modifying the proof, then perhaps starting from scratch based on a new insight… all this reads uncannily like the process of debugging source code.

The argument for this correspondence is strengthened by later work in theory of computation and complexity theory. I learned this theory so long ago I forget who to attribute it to, but much of the foundational work in computer science was the establishment of a correspondence between classes of formal logic and classes of programming languages. So in a sense its uncontroversial within computer science to consider programs to be proofs.

As I write I am unsure whether I’m simply restating what’s obvious to computer scientists in an antiquated philosophical language (a danger I feel every time I read a book, lately) or if I’m capturing something that could be an interesting synthesis. But my point is this: that if programming language design and the construction of progressively more powerful software libraries is akin to the expanding of formal mathematical knowledge from axiomatic grounds, then the act of programming itself is much more like the informal mathematics of pre-Russellian mathematics. Specifically, in that it is unaxiomatic and proofs are in play without necessarily being sound. When we use a software system, we are depending necessarily on a system of imperfected proofs that we fix iteratively through discovered counterexamples (bugs).

Is it fair to say, then, that whereas the logic of software is formal, deductive logic, the logic of programming is dialectical logic?

Bear with me; let’s presume it is. That’s a foundational idea of my dissertation work. Proving or disproving it may or may not be out of scope of the dissertation itself, but it’s where it’s ultimately headed.

The question is whether it is possible to develop a formal understanding of dialectical logic through a scientific analysis of the software collaboration. (see a mathematical model of collective creativity). If this could be done, then we could then build better software or protocols to assist this dialectical process.

Dissertron build notes

I’m going to start building the Dissertron now. These are my notes.

  • I’m going with Hyde as a static site generator on Nick Doty‘s recommendation. It appears to be tracking Jekyll in terms of features, but squares better with my Python/Django background (it uses Jinja2 templates in its current, possibly-1.0-but-under-development version). Meanwhile, at Berkeley we seem to be investing a lot in Python as the language of scientific computing. If scientists skills should be transferrable to their publication tool, this seems like the way to go.
  • Documentation for Hyde is a bit scattered. This first steps guide is sort of helpful, and then there are these docs hosted on Github. As mentioned, they’ve moved away from Django templates to Jinja2, which is similar but less idiosyncratic. They refer you to the Jinja2 docs here for templating.
  • Just trying to make a Hello World type site, I ran into an issue with Markdown rendering. I’ve filed an issue with the project, and will use it as a test of the community’s responsiveness. Since Hyde is competing with a lot of other Python static site generators, it’s kind of nice to bump into this kind of thing early.
  • Got this response from the creator of Hyde in less than 3 hours. Problem was with my Jinja2 fu (which is weak at the moment)–turns out I have a lot to learn about Whitespace Control. Super positive community experience. I’ll stick with Hyde.
  • “Hello World” intact and framework chosen, my next step is to convert part 2 of my Weird Twitter work to Markdown and use Hyde’s tools to give it some decent layout. If I can make some headway on the citation formating and management in the process, so much the better.

@#$%! : variance annotations in Scala’s unsound parameterized types

[error] /home/sb/ischool/cs294/hw3/src/main/scala/TestScript.scala:32: type mismatch;
[error] found : Array[wikilearn.TestScript.parser.Page]
[error] required: Array[wikilearn.WikiParser#Page]
[error] Note: wikilearn.TestScript.parser.Page <: wikilearn.WikiParser#Page, but class Array is invariant in type T.
[error] You may wish to investigate a wildcard type such as `_ <: wikilearn.WikiParser#Page`. (SLS 3.2.10)

wtf, Scala.  You know exactly what I’m trying to do here.

EDIT: I sent a link to the above post to David Winslow. He responded with a crystal clear explanation that was so great I asked him if I could include it here. This is it, below:

It’s a feature, not a bug :) This is actually the specific issue that Dart had in mind when they put this note in the language spec:

The type system is unsound, due to the covariance of generic types. This is a deliberate choice (and undoubtedly controversial). Experience has shown that sound type rules for generics fly in the face of programmer intuition. It is easy for tools to provide a sound type analysis if they choose, which may be useful for tasks like refactoring.

Which of course caused some hubbub among the static typing crowd.

The whole issue comes down to the variance annotations of type parameters Variance influences how type parameters relate to the subtyping relationships of parameterized types:

Given types A and B, A is a supertype of B
trait Invariant[T] means there is no subtype relationship between Invariant[A] and invariant[B]. (Either could be used as an Invariant[_] though)
trait Covariant[+T] means Covariant[A] is a supertype of Covariant[B]
trait Contravariant[-T] means Contravariant[A] is a subtype of Contravariant[B].

The basic rule of thumb is that if you produce values of type T, you can be covariant in T, and if you consume values of type U, you can be contravariant in type U. For example, Function1 has two type parameters, the parameter type A and the result type T. it is contravariant in A and covariant in T. An (Any => String) can be used where a (String => Any) is expected, but not the other way around.

So, what about the type parameter for Array[T]? Among other operations, Arrays provide:

class Array[T] {
  def apply(i: Int): T // "producing" a T
  def update(i: Int, t: T): Unit // "consuming" a T
}

When the type parameter appears in contravariant and covariant positions the only option is to make it invariant.

Now, it’s interesting to note that in the Java language Arrays are treated as if they are covariant. This means that you can write a Java program that doesn’t use casts, passes the typechecker, and generates a type error at runtime; the body of main() would look like:

String[] strings = new String[1];
Object[] objects = strings;
objects[0] = Integer.valueOf(0); // the runtime error occurs at this step, but even if it didn't: 
System.out.println(strings[0]); // what happens here?

Anyway, the upshot is that immutable collections only use their types in covariant positions (you can get values out, but never insert) so they are much handier. Does your code work better if you replace your usage of Array with Vector? Alternatively, you can always provide the type parameter when you construct your array. Array(“”) is an Array[String], but Array[AnyRef](“”) is an Array[AnyRef].

Bash script for converting all .wav files in a directory to .mp3

I’ve been working with music files lately trying to get Steve Morrell‘s music online. In the process I’ve had to convert his albums, which I’ve ripped in .wav format, to .mp3.

To accomplish this, I’ve written a short bash script. It’s requires a number of tricks I wasn’t familiar with and had to look up.

#!/bin/bash

SAVEIF=$IFS
IFS=$(echo -en "\n\b")

for file in $(ls *wav)
do
  name=${file%%.wav}
  lame -V0 -h -b 160 --vbr-new $name.wav $name.mp3
done


IFS=$SAVEIFS

Though it isn’t recommended, I did the for loop on ls because I wanted to limit it to .wav files. But that means the script chokes on file names with spaces unless you swap out the IFS variable.

I used LAME for the conversion.

Hadoop with Scala: hacking notes

I am trying to learn how to use Hadoop. I’m am trying to learn to program in Scala. I mostly forget how to program in Java. In this post I will take notes on things that come up as I try to get my frickin’ code to compile so I can run a Hadoop job.

There was a brief window in my life when I was becoming a good programmer. It was around the end of my second year as a professional software engineer that I could write original code to accomplish novel tasks.

Since then, the tools and my tasks have changed. For the most part, my coding has been about projects for classes, really just trying to get a basic competence in commodity open tools. So, my “programming” consists largely of cargo-culting code snippets and trying to get them to run in a slightly modified environment.

Right now I’ve got an SBT project; I’m trying to write a MapReduce job in Scala that will compile as a .jar that I can run on Hadoop.

One problem I’m having is there are apparently several different coding patterns for doing this, and several frameworks that are supposed to make my life easier. These include SMR, Shadoop, and Scalding. But since I’m doing this for a class and I actually want to learn something about how Hadoop works, I’m worried about having to good a level of abstraction.

So I’m somewhat perversely taking the Scala Wordcount example from jweslley’s Shadoop and make it dumber. I.e., not use Shadoop.

One thing that has been confusing as hell is that there Hadoop has a Mapper interface and a Mapper class, both with map() functions (1,2), but those functions haved different type signatures.

I started working with some other code that used the second map() function. One of the arguments to this function is of type Mapper.Context. I.e., the Context class is a nested member of the Mapper class. Unfortunately, referencing this class within Scala is super hairy. I saw a code snippet that did this:

override def map(key:Object, value:Text, context:Mapper[Object,Text,Text,IntWritable]#Context) = {
    for (t <-  value.toString().split("\\s")) {
      word.set(t)
      context.write(word, one)
    }
  }

But I couldn’t get this thing to compile. Kept getting this awesome error:

type Context is not a member of org.apache.hadoop.mapred.Mapper[java.lang.Object,org.apache.hadoop.io.Text,org.apache.hadoop.io.Text,org.apache.hadoop.io.IntWritable]

Note the gnarliness here. It’s not super clear whether or how Context is parameterized by the type parameters of Mapper. The docs for the Mapper class make it seem like you can refer to Context without type parameterization within the code of the class extending Mapper. But I didn’t see that until I had deleted everything and tried a different track, which was to use the Mapper interface in a class extending MapReduceBase.

Oddly, this interface hides the Context mechanic and instead introduces the Reporter class as a final argument to map(). I find this less intimidating for some reason. Probably because after years of working in Python and JavaScript my savvinness around the Java type hierarchy is both rusty and obsolete. With the added type magicalness of Scala to add complexity to the mix, I think I’ve got to steer towards the dumbest implementation possible. And at the level I’m at, it looks like I don’t ever have to touch or think about this Reporter.

So, now starting with the example from Shadoop, now I just need to decode the Scala syntactic sugar that Shadoop provides to figure out what the hell is actually going on.

Consider:

  class Map extends MapReduceBase with Mapper[LongWritable, Text, Text, IntWritable] {

    val one = 1

    def map(key: LongWritable, value: Text, output: OutputCollector[Text, IntWritable], reporter: Reporter) =
      (value split " ") foreach (output collect (_, one))
  }

This is beautiful concise code. But since I want to know something about the underlying program I’m going to uglify it by removing the implict conversions provided by Shadoop.

The Shadoop page provides a Java equivalent for this, but that’s not really what I want either. For some reason I demand the mildy more concise syntax of Scala over Java but not the kind of condensed, beautiful syntax Scala makes possible with additional libraries.

This compiles at least:

  class Map extends MapReduceBase with Mapper[LongWritable, Text, Text,
 IntWritable] {   

    val one = new IntWritable(1); 

    def map(key: LongWritable, value: Text, output: OutputCollector[Text,
     IntWritable], reporter: Reporter) = {
      var line = value.toString();

      for(word <- line.split(" ")){
        output.collect(new Text(word), one)
      }
    }
  }

What I find a little counterintuitive about this is that the OutputCollector doesn’t act like a dictionary, overwriting the key-value pair with each call to collect(). I guess since I’m making a new Text object with each new entry, that makes sense even if the collector is implemented as a hash map of some kind. (Shadoop hides this mechanism with implicit conversions, which is rad of course.)

Next comes the reducer. The Shadoop code is this:

def reduce(key: Text, values: Iterator[IntWritable],
            output: OutputCollector[Text, IntWritable], reporter: Reporter) = {
  val sum = values reduceLeft ((a: Int, b: Int) => a + b)
  output collect (key, sum)
}

Ok, so there’s a problem here. The whole point of using Scala to code a MapReduce job is so that you can use Scala’s built in reduceLeft function inside the reduce method of the Reducer. Because functional programming is awesome. By which I mean using built-in functions for things like map and reduce operations are awesome. And Scala supports functional programming, in at the very least that sense. And MapReduce as a computing framework is at least analogous to that paradigm in functional programming, and even has the same name. So, OMG.

Point being, no way in hell am I going to budge on this minor aesthetic point in my toy code. Instead, I’m going to brazenly pillage jweslley’s source code for the necessary implicit type conversion.

  implicit def javaIterator2Iterator[A](value: java.util.Iterator[A]) = new Iterator[A] {
    def hasNext = value.hasNext
    def next = value.next
  }

But not the other implicit conversions that would make my life easier. That would be too much.

Unfortunately, I couldn’t get this conversion to work right. Attempting to run the code gives me the following error:

[error] /home/cc/cs294/sp13/class/cs294-bg/hw3/wikilearn/src/main/scala/testIt/WordCount.scala:33: type mismatch;
[error]  found   : java.util.Iterator[org.apache.hadoop.io.IntWritable]
[error]  required: Iterator[org.apache.hadoop.io.IntWritable]
[error]       val sum = (values : scala.collection.Iterator[IntWritable]).reduceLeft (
[error]                  ^

It beats me why this doesn’t work. In my mental model of how implicit conversion is supposed to work, the java.util.Iterator[IntWritable] should be caught by the parameterized implicit conversion (which I defined within the Object scope) and converted no problemo.

I can’t find any easy explanation of this on-line at the moment. I suspect it’s a scoping issue or a limit to the parameterization of implicit conversions. Or maybe because Iterator is a trait, not a class? Instead I’m going to do the conversion explicitly in the method code.

After fussing around for a bit, I got:

    def reduce(key: Text, values: java.util.Iterator[IntWritable],                                                               
      output: OutputCollector[Text, IntWritable], reporter: Reporter) = {                                                        
      val svals = new scala.collection.Iterator[IntWritable]{
        def hasNext = values.hasNext
        def next = values.next
      }
      val sum = (svals : scala.collection.Iterator[IntWritable])\|
.reduceLeft (
       (a: IntWritable, b: IntWritable) => new IntWritable(a.get() + b.get())}
      ) 
      output collect (key, sum)
    }

…or, equivalently and more cleanly:

    def reduce(key: Text, values: java.util.Iterator[IntWritable],
      output: OutputCollector[Text, IntWritable], reporter: Reporter) = {

      val svals = new scala.collection.Iterator[Int]{
        def hasNext = values.hasNext
        def next = values.next.get
      }

      val sum = (svals : scala.collection.Iterator[Int]).reduceLeft (
        (a: Int, b: Int) => a + b
      )
      output collect (key, new IntWritable(sum))
    }
  }

I find the Scala syntax for defining the methods of an abstract class here pretty great (I hadn’t encountered it before). Since Iterator[A] is an abstract class, you define the methods next and hasNext inside the curly braces. What an elegant way to let people subclass abstract classes in an ad hoc way!

There’s one more compile error I had to bust around. This line was giving me noise:

conf setOutputFormat classOf[TextOutputFormat[_ <: WritableComparable, _ <: Writable]]

It was complaining that WriteComparable needed a type parameter. Not confident I could figure out exactly which parameter to set, I just made the signature tighter.

conf setOutputFormat classOf[TextOutputFormat[Text, IntWritable]]

Only then did I learn that JobConf is a deprecated way of defining jobs. So I rewrote WordCount object into a class implementing the Tool interface, using this Java snippet as an example to work from. To do that, I had to learn the to write a class that extends two interfaces in Scala, you need to use a “extends X with Y” syntax. Also, for trivial conditionals Scala dispenses with Java’s ternary X ? Y : Z operator in favor of a single line if (X) Y else Z. Though I will miss the evocative use of punctuation in the ternary construct, I’ve got to admit that Scala is keeping it real classy here.

Wait…ok, so I just learned that most of the code I was cargo culting was part of the deprecated coding pattern, which means I now have to switch it over to the new API. I learned this from somebody helpful in the #hadoop IRC channel.

[23:31]  what's the deal with org.apache.hadoop.mapreduce.Mapper and org.apache.hadoop.mapred.Mapper ??
[23:31]  is one of them deprecated?
[23:31]  which should I be using?
[23:32]  sbenthall: Use the new API
[23:32]  sbenthall: (i.e. mapreduce.*) both are currently supported but eventually the (mapred.*) may get deprecated
[23:32]  ok thanks QwertyM
[23:33]  sbenthall: as a reference, HBase uses mapreduce.* APIs completely for its provided MR jobs; and I believe Pig too uses the new APIs
[23:33]  is MapReduceBase part of the old API?
[23:33]  sbenthall: yes, its under the mapred.* package
[23:33]  ok, thanks.

Parachuting into the middle of a project has it’s drawbacks, but it’s always nice when a helpful community member can get you up to speed. Even if you’re asking near midnight on a Sunday.

Wait. I realize now that I’ve come full circle.

See, I’ve been writing these notes over the course of several days. Only just now am I realizing that I’m not going back to where I started, with the Mapper class that takes the Context parameter that was giving me noise.

Looking back at the original error, it looks like that too was a result of mixing two API’s. So maybe I can now safely shift everything BACK to the new API, drawing heavily on this code.

It occurs to me that this is one of those humbling programming experiences when you discover that the reason why your thing was broken was not the profound complexity of the tool you were working with, but your own stupidity over something trivial. This happens to me all the time.

Thankfully, I can’t ponder that much now, since it’s become clear that the instructional Hadoop cluster on which we’ve been encouraged to do our work are highly unstable. So I’m going to take the bet that it will be more productive for me to work locally, even if that means installing Hadoop locally on my Ubuntu machine.

I thought I was doing pretty good with the installation until I got to the point of hitting the “ON” switch on Hadoop. I got this:

sb@lebensvelt:~$ /usr/local/hadoop/bin/start-all.sh 
Warning: $HADOOP_HOME is deprecated.

starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-sb-namenode-lebensvelt.out
localhost: ssh: connect to host localhost port 22: Connection refused
localhost: ssh: connect to host localhost port 22: Connection refused
starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-sb-jobtracker-lebensvelt.out
localhost: ssh: connect to host localhost port 22: Connection refused

I googled around and it looks like this problem is due to not having an SSH server running locally. Since I’m running Ubuntu, I went ahead and followed these instructions. In the process I managed to convince my computer that I was undergoing a man-in-the-middle attack between myself and myself.

I fixed that with

$ ssh-keygen -R localhost

and successfully got Hadoop running with

$ /usr/local/hadoop/bin/start-all.sh 

only to be hung up on this error

$ hadoop fs -ls
Warning: $HADOOP_HOME is deprecated.

ls: Cannot access .: No such file or directory.

which somebody who runs an Indian matrimony search engine had run into and documented the fix for. (Right way to spell it is

hadoop fs -ls .

With an extra dot.)

There’s a point to me writing all this out, by the way. An important part of participation in open source software, or the hacker ethic in general, is documenting ones steps so that others who follow the same paths can benefit from what you’ve gone through. I’m going into a bit more detail about this process than really helpful because in my academic role I’m dealing with a lot of social scientist types who really don’t know what this kind of work entails. Let’s face it: programming is a widely misunderstood discipline which seems like an utter mystery to those that aren’t deeply involved in it. Much of this has to do with the technical opacity of the work. But another part of why its misunderstood is because problem solving in the course of development depends on a vast and counter-intuitive cyberspace of documentation (often generated from code comments, so written by some core developer), random blog posts, chat room conversations, forum threads. Easily 80% of the work when starting out on a new project like this is wrestling with all the minutia of configuration on a particular system (operating system and hardware contingent, in many cases) and idioms of programming language and environment.

The amount of time it takes to invest in any particular language or toolkit necessarily creates a tribalism among developers because their identities wind up being intertwined with the tools they use. As I hack on this thing, however incompetently, I’m becoming a Scala developer. That’s similar to saying that I’m becoming a German speaker. My conceptual vocabulary, once I’ve learned how to get things done in Scala, is going to be different than it was before. In fact, that’s one of the reasons why I’m insisting on teaching myself Scala in the first place–because I know that it is a conceptually deep and rigorous language which will have something to teach me about the Nature of Things.

Some folks in my department are puzzled at the idea that technical choices in software development might be construed as ethical choices by the developers themselves. Maybe it’s easier to understand that if you see that in choosing a programming language you are in many ways choosing an ontology or theoretical framework through which to conceive of problem-solving. Of course, choice of ontology will influence ones ethical principles, right?

But I digress.

So I have Hadoop running on my laptop now, and a .jar file that compiles in SBT. So now all I need to do is run the .jar using the hadoop jar command, right?

Nope, not yet…

Exception in thread "main" java.lang.NoClassDefFoundError: scala/ScalaObject

OK, so I the problem is that I haven’t included scala-library.jar on my Hadoop runtime classpath.

I solved this by making a symbolic link from the Hadoop /lib directory to the .jar in my Scala installation.

ln -s /usr/local/share/scala-2.9.2/lib/scala-library.jar /usr/local/hadoop/lib/scala-library.jar

That seemed to work, only now I have the most mundane and inscrutable of Java errors to deal with:

Exception in thread "main" java.lang.NullPointerException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:601)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

I had no idea how to proceed from here. This time, no helpful folks in the #hadoop channel helped me either.

So once again I switched to a new hunk of code to work from, this time the WordCount.scala file from Derrick Cheng’s ScalaOnHadoop project. Derrick posted a link to this project on our course’s Piazza earlier, which was awesome.

Another digression. There’s a lot of talk now about on-line education. People within the university context feel vaguely threatened by MOOCs, believing that there will be a superstar effect that advantages first movers, but many are uncomfortable making that first move. Taek Lim in my department is starting to studying user interfaces to support collaboration in on-line learning.

My own two cents on this are that the open software model is about as dynamics a collaborative environment as you can get, and at the point when people started to use our course discussion management system, Piazza, as if it were a forum to discuss the assignment almost as if it was an open source mailing list, we started to get a lot more out of it and learned a lot from each other. I’m not the first person to see this potential of the open source model for education, of course. I’m excited to be attending the POSSE workshop, which is about that intersection, this summer. At this rate, it looks like I will be co-teaching a course along these lines in the Fall targeted at the I School Masters students, which is exciting!

So, anyway, I’m patching Derrick Cheng’s code. I’m not working with BIDMat yet so I’m leaving that out of the build, so I remove all references to that and I get the thing to compile…and run! I got somebody else’s Scala WordCount to run!

This seems like such a triumph. It’s taken a lot of tinkering and beating my head against a wall to get this far. (Not all at once–I’ve wrote this post over the course of several days. I realize that it’s a little absurd.)

But wait! I’m not out of the woods yet. I check the output of my MapReduce job with

hadoop fs -cat test-output/part-r-00000

and the end of my output file looks like this:

§	1
§	1
§	1
§	1
§	1
§	1
§	1
§	1
§	1
Æschylus	1
Æsop	1
Æsop	1
Æsop	1
É	1
Élus,_	1
à	1
à	1
à	1
æons	1
æsthetic	1
è	1
è	1
è	1
état_.	1
The	1
The	1

What’s going on here? Well, it looks like I successfully mapped each occurrence of each word in the original text to a key-value pair. But something went wrong in the reduce step, that was supposed to combine all the occurrences into a single count for each word.

That is just fine for me, because I’d rather be using my own Reduce function. Because it uses Scala’s functional reduceLeft, which is sweet! Why even write a Map Reduce job in a functional programming language if you can’t use a built=in language reduce in the Reduce step?

Ok, mine doesn’t work either.

Apparently, the reason for this is that the type signature I’ve been using for the Reducer’s reduce method has been wrong all along. And when that happens, the code compiles but Reducer runs its default reduce function, which is the identity function.

It’s almost (almost!) as if it would have made more sense to start by just reading the docs and following the instructions.

Now I have edited the map and reduce functions so that they have the right type signatures. To get this right exactly, I looked at a different file. I also tinkered

At last, it works.

Now, at last, I understand how this Context member class works. The problem was that I was trying to use it with the mapred.Mapper class from the old API. So much of my goose chase was due to not reading things carefully enough.

On the other hand, I feel enriched by the whole meandering process. Realizing that my programming faults were mine and not due to the complexity of the tools I was using paradoxically gives me more confidence in my understanding of the tools moving forward. And engaging with the distributed expertise on the subject–through StackOverflow, documentation generated from the original coders, helpful folks on IRC, blog posts, and so on–is much more compelling when one is driven by concrete problem-solving goals, even when those goals are somewhat arbitrary. Had I learned to use Hadoop in a less circuitous way, my understanding would probably be much more brittle. I am integrating new knowledge of Hadoop, Scala, and Java (it’s been a long time) with existing background knowledge. After a good night’s sleep, with any luck it will be part of my lifeworld!

This is the code I wound up with, by the way. I don’t suggest you use it.