Digifesto

Tag: scala

@#$%! : variance annotations in Scala’s unsound parameterized types

[error] /home/sb/ischool/cs294/hw3/src/main/scala/TestScript.scala:32: type mismatch;
[error] found : Array[wikilearn.TestScript.parser.Page]
[error] required: Array[wikilearn.WikiParser#Page]
[error] Note: wikilearn.TestScript.parser.Page <: wikilearn.WikiParser#Page, but class Array is invariant in type T.
[error] You may wish to investigate a wildcard type such as `_ <: wikilearn.WikiParser#Page`. (SLS 3.2.10)

wtf, Scala.  You know exactly what I’m trying to do here.

EDIT: I sent a link to the above post to David Winslow. He responded with a crystal clear explanation that was so great I asked him if I could include it here. This is it, below:

It’s a feature, not a bug :) This is actually the specific issue that Dart had in mind when they put this note in the language spec:

The type system is unsound, due to the covariance of generic types. This is a deliberate choice (and undoubtedly controversial). Experience has shown that sound type rules for generics fly in the face of programmer intuition. It is easy for tools to provide a sound type analysis if they choose, which may be useful for tasks like refactoring.

Which of course caused some hubbub among the static typing crowd.

The whole issue comes down to the variance annotations of type parameters Variance influences how type parameters relate to the subtyping relationships of parameterized types:

Given types A and B, A is a supertype of B
trait Invariant[T] means there is no subtype relationship between Invariant[A] and invariant[B]. (Either could be used as an Invariant[_] though)
trait Covariant[+T] means Covariant[A] is a supertype of Covariant[B]
trait Contravariant[-T] means Contravariant[A] is a subtype of Contravariant[B].

The basic rule of thumb is that if you produce values of type T, you can be covariant in T, and if you consume values of type U, you can be contravariant in type U. For example, Function1 has two type parameters, the parameter type A and the result type T. it is contravariant in A and covariant in T. An (Any => String) can be used where a (String => Any) is expected, but not the other way around.

So, what about the type parameter for Array[T]? Among other operations, Arrays provide:

class Array[T] {
  def apply(i: Int): T // "producing" a T
  def update(i: Int, t: T): Unit // "consuming" a T
}

When the type parameter appears in contravariant and covariant positions the only option is to make it invariant.

Now, it’s interesting to note that in the Java language Arrays are treated as if they are covariant. This means that you can write a Java program that doesn’t use casts, passes the typechecker, and generates a type error at runtime; the body of main() would look like:

String[] strings = new String[1];
Object[] objects = strings;
objects[0] = Integer.valueOf(0); // the runtime error occurs at this step, but even if it didn't: 
System.out.println(strings[0]); // what happens here?

Anyway, the upshot is that immutable collections only use their types in covariant positions (you can get values out, but never insert) so they are much handier. Does your code work better if you replace your usage of Array with Vector? Alternatively, you can always provide the type parameter when you construct your array. Array(“”) is an Array[String], but Array[AnyRef](“”) is an Array[AnyRef].

Hadoop with Scala: hacking notes

I am trying to learn how to use Hadoop. I’m am trying to learn to program in Scala. I mostly forget how to program in Java. In this post I will take notes on things that come up as I try to get my frickin’ code to compile so I can run a Hadoop job.

There was a brief window in my life when I was becoming a good programmer. It was around the end of my second year as a professional software engineer that I could write original code to accomplish novel tasks.

Since then, the tools and my tasks have changed. For the most part, my coding has been about projects for classes, really just trying to get a basic competence in commodity open tools. So, my “programming” consists largely of cargo-culting code snippets and trying to get them to run in a slightly modified environment.

Right now I’ve got an SBT project; I’m trying to write a MapReduce job in Scala that will compile as a .jar that I can run on Hadoop.

One problem I’m having is there are apparently several different coding patterns for doing this, and several frameworks that are supposed to make my life easier. These include SMR, Shadoop, and Scalding. But since I’m doing this for a class and I actually want to learn something about how Hadoop works, I’m worried about having to good a level of abstraction.

So I’m somewhat perversely taking the Scala Wordcount example from jweslley’s Shadoop and make it dumber. I.e., not use Shadoop.

One thing that has been confusing as hell is that there Hadoop has a Mapper interface and a Mapper class, both with map() functions (1,2), but those functions haved different type signatures.

I started working with some other code that used the second map() function. One of the arguments to this function is of type Mapper.Context. I.e., the Context class is a nested member of the Mapper class. Unfortunately, referencing this class within Scala is super hairy. I saw a code snippet that did this:

override def map(key:Object, value:Text, context:Mapper[Object,Text,Text,IntWritable]#Context) = {
    for (t <-  value.toString().split("\\s")) {
      word.set(t)
      context.write(word, one)
    }
  }

But I couldn’t get this thing to compile. Kept getting this awesome error:

type Context is not a member of org.apache.hadoop.mapred.Mapper[java.lang.Object,org.apache.hadoop.io.Text,org.apache.hadoop.io.Text,org.apache.hadoop.io.IntWritable]

Note the gnarliness here. It’s not super clear whether or how Context is parameterized by the type parameters of Mapper. The docs for the Mapper class make it seem like you can refer to Context without type parameterization within the code of the class extending Mapper. But I didn’t see that until I had deleted everything and tried a different track, which was to use the Mapper interface in a class extending MapReduceBase.

Oddly, this interface hides the Context mechanic and instead introduces the Reporter class as a final argument to map(). I find this less intimidating for some reason. Probably because after years of working in Python and JavaScript my savvinness around the Java type hierarchy is both rusty and obsolete. With the added type magicalness of Scala to add complexity to the mix, I think I’ve got to steer towards the dumbest implementation possible. And at the level I’m at, it looks like I don’t ever have to touch or think about this Reporter.

So, now starting with the example from Shadoop, now I just need to decode the Scala syntactic sugar that Shadoop provides to figure out what the hell is actually going on.

Consider:

  class Map extends MapReduceBase with Mapper[LongWritable, Text, Text, IntWritable] {

    val one = 1

    def map(key: LongWritable, value: Text, output: OutputCollector[Text, IntWritable], reporter: Reporter) =
      (value split " ") foreach (output collect (_, one))
  }

This is beautiful concise code. But since I want to know something about the underlying program I’m going to uglify it by removing the implict conversions provided by Shadoop.

The Shadoop page provides a Java equivalent for this, but that’s not really what I want either. For some reason I demand the mildy more concise syntax of Scala over Java but not the kind of condensed, beautiful syntax Scala makes possible with additional libraries.

This compiles at least:

  class Map extends MapReduceBase with Mapper[LongWritable, Text, Text,
 IntWritable] {   

    val one = new IntWritable(1); 

    def map(key: LongWritable, value: Text, output: OutputCollector[Text,
     IntWritable], reporter: Reporter) = {
      var line = value.toString();

      for(word <- line.split(" ")){
        output.collect(new Text(word), one)
      }
    }
  }

What I find a little counterintuitive about this is that the OutputCollector doesn’t act like a dictionary, overwriting the key-value pair with each call to collect(). I guess since I’m making a new Text object with each new entry, that makes sense even if the collector is implemented as a hash map of some kind. (Shadoop hides this mechanism with implicit conversions, which is rad of course.)

Next comes the reducer. The Shadoop code is this:

def reduce(key: Text, values: Iterator[IntWritable],
            output: OutputCollector[Text, IntWritable], reporter: Reporter) = {
  val sum = values reduceLeft ((a: Int, b: Int) => a + b)
  output collect (key, sum)
}

Ok, so there’s a problem here. The whole point of using Scala to code a MapReduce job is so that you can use Scala’s built in reduceLeft function inside the reduce method of the Reducer. Because functional programming is awesome. By which I mean using built-in functions for things like map and reduce operations are awesome. And Scala supports functional programming, in at the very least that sense. And MapReduce as a computing framework is at least analogous to that paradigm in functional programming, and even has the same name. So, OMG.

Point being, no way in hell am I going to budge on this minor aesthetic point in my toy code. Instead, I’m going to brazenly pillage jweslley’s source code for the necessary implicit type conversion.

  implicit def javaIterator2Iterator[A](value: java.util.Iterator[A]) = new Iterator[A] {
    def hasNext = value.hasNext
    def next = value.next
  }

But not the other implicit conversions that would make my life easier. That would be too much.

Unfortunately, I couldn’t get this conversion to work right. Attempting to run the code gives me the following error:

[error] /home/cc/cs294/sp13/class/cs294-bg/hw3/wikilearn/src/main/scala/testIt/WordCount.scala:33: type mismatch;
[error]  found   : java.util.Iterator[org.apache.hadoop.io.IntWritable]
[error]  required: Iterator[org.apache.hadoop.io.IntWritable]
[error]       val sum = (values : scala.collection.Iterator[IntWritable]).reduceLeft (
[error]                  ^

It beats me why this doesn’t work. In my mental model of how implicit conversion is supposed to work, the java.util.Iterator[IntWritable] should be caught by the parameterized implicit conversion (which I defined within the Object scope) and converted no problemo.

I can’t find any easy explanation of this on-line at the moment. I suspect it’s a scoping issue or a limit to the parameterization of implicit conversions. Or maybe because Iterator is a trait, not a class? Instead I’m going to do the conversion explicitly in the method code.

After fussing around for a bit, I got:

    def reduce(key: Text, values: java.util.Iterator[IntWritable],                                                               
      output: OutputCollector[Text, IntWritable], reporter: Reporter) = {                                                        
      val svals = new scala.collection.Iterator[IntWritable]{
        def hasNext = values.hasNext
        def next = values.next
      }
      val sum = (svals : scala.collection.Iterator[IntWritable])\|
.reduceLeft (
       (a: IntWritable, b: IntWritable) => new IntWritable(a.get() + b.get())}
      ) 
      output collect (key, sum)
    }

…or, equivalently and more cleanly:

    def reduce(key: Text, values: java.util.Iterator[IntWritable],
      output: OutputCollector[Text, IntWritable], reporter: Reporter) = {

      val svals = new scala.collection.Iterator[Int]{
        def hasNext = values.hasNext
        def next = values.next.get
      }

      val sum = (svals : scala.collection.Iterator[Int]).reduceLeft (
        (a: Int, b: Int) => a + b
      )
      output collect (key, new IntWritable(sum))
    }
  }

I find the Scala syntax for defining the methods of an abstract class here pretty great (I hadn’t encountered it before). Since Iterator[A] is an abstract class, you define the methods next and hasNext inside the curly braces. What an elegant way to let people subclass abstract classes in an ad hoc way!

There’s one more compile error I had to bust around. This line was giving me noise:

conf setOutputFormat classOf[TextOutputFormat[_ <: WritableComparable, _ <: Writable]]

It was complaining that WriteComparable needed a type parameter. Not confident I could figure out exactly which parameter to set, I just made the signature tighter.

conf setOutputFormat classOf[TextOutputFormat[Text, IntWritable]]

Only then did I learn that JobConf is a deprecated way of defining jobs. So I rewrote WordCount object into a class implementing the Tool interface, using this Java snippet as an example to work from. To do that, I had to learn the to write a class that extends two interfaces in Scala, you need to use a “extends X with Y” syntax. Also, for trivial conditionals Scala dispenses with Java’s ternary X ? Y : Z operator in favor of a single line if (X) Y else Z. Though I will miss the evocative use of punctuation in the ternary construct, I’ve got to admit that Scala is keeping it real classy here.

Wait…ok, so I just learned that most of the code I was cargo culting was part of the deprecated coding pattern, which means I now have to switch it over to the new API. I learned this from somebody helpful in the #hadoop IRC channel.

[23:31]  what's the deal with org.apache.hadoop.mapreduce.Mapper and org.apache.hadoop.mapred.Mapper ??
[23:31]  is one of them deprecated?
[23:31]  which should I be using?
[23:32]  sbenthall: Use the new API
[23:32]  sbenthall: (i.e. mapreduce.*) both are currently supported but eventually the (mapred.*) may get deprecated
[23:32]  ok thanks QwertyM
[23:33]  sbenthall: as a reference, HBase uses mapreduce.* APIs completely for its provided MR jobs; and I believe Pig too uses the new APIs
[23:33]  is MapReduceBase part of the old API?
[23:33]  sbenthall: yes, its under the mapred.* package
[23:33]  ok, thanks.

Parachuting into the middle of a project has it’s drawbacks, but it’s always nice when a helpful community member can get you up to speed. Even if you’re asking near midnight on a Sunday.

Wait. I realize now that I’ve come full circle.

See, I’ve been writing these notes over the course of several days. Only just now am I realizing that I’m not going back to where I started, with the Mapper class that takes the Context parameter that was giving me noise.

Looking back at the original error, it looks like that too was a result of mixing two API’s. So maybe I can now safely shift everything BACK to the new API, drawing heavily on this code.

It occurs to me that this is one of those humbling programming experiences when you discover that the reason why your thing was broken was not the profound complexity of the tool you were working with, but your own stupidity over something trivial. This happens to me all the time.

Thankfully, I can’t ponder that much now, since it’s become clear that the instructional Hadoop cluster on which we’ve been encouraged to do our work are highly unstable. So I’m going to take the bet that it will be more productive for me to work locally, even if that means installing Hadoop locally on my Ubuntu machine.

I thought I was doing pretty good with the installation until I got to the point of hitting the “ON” switch on Hadoop. I got this:

sb@lebensvelt:~$ /usr/local/hadoop/bin/start-all.sh 
Warning: $HADOOP_HOME is deprecated.

starting namenode, logging to /usr/local/hadoop/libexec/../logs/hadoop-sb-namenode-lebensvelt.out
localhost: ssh: connect to host localhost port 22: Connection refused
localhost: ssh: connect to host localhost port 22: Connection refused
starting jobtracker, logging to /usr/local/hadoop/libexec/../logs/hadoop-sb-jobtracker-lebensvelt.out
localhost: ssh: connect to host localhost port 22: Connection refused

I googled around and it looks like this problem is due to not having an SSH server running locally. Since I’m running Ubuntu, I went ahead and followed these instructions. In the process I managed to convince my computer that I was undergoing a man-in-the-middle attack between myself and myself.

I fixed that with

$ ssh-keygen -R localhost

and successfully got Hadoop running with

$ /usr/local/hadoop/bin/start-all.sh 

only to be hung up on this error

$ hadoop fs -ls
Warning: $HADOOP_HOME is deprecated.

ls: Cannot access .: No such file or directory.

which somebody who runs an Indian matrimony search engine had run into and documented the fix for. (Right way to spell it is

hadoop fs -ls .

With an extra dot.)

There’s a point to me writing all this out, by the way. An important part of participation in open source software, or the hacker ethic in general, is documenting ones steps so that others who follow the same paths can benefit from what you’ve gone through. I’m going into a bit more detail about this process than really helpful because in my academic role I’m dealing with a lot of social scientist types who really don’t know what this kind of work entails. Let’s face it: programming is a widely misunderstood discipline which seems like an utter mystery to those that aren’t deeply involved in it. Much of this has to do with the technical opacity of the work. But another part of why its misunderstood is because problem solving in the course of development depends on a vast and counter-intuitive cyberspace of documentation (often generated from code comments, so written by some core developer), random blog posts, chat room conversations, forum threads. Easily 80% of the work when starting out on a new project like this is wrestling with all the minutia of configuration on a particular system (operating system and hardware contingent, in many cases) and idioms of programming language and environment.

The amount of time it takes to invest in any particular language or toolkit necessarily creates a tribalism among developers because their identities wind up being intertwined with the tools they use. As I hack on this thing, however incompetently, I’m becoming a Scala developer. That’s similar to saying that I’m becoming a German speaker. My conceptual vocabulary, once I’ve learned how to get things done in Scala, is going to be different than it was before. In fact, that’s one of the reasons why I’m insisting on teaching myself Scala in the first place–because I know that it is a conceptually deep and rigorous language which will have something to teach me about the Nature of Things.

Some folks in my department are puzzled at the idea that technical choices in software development might be construed as ethical choices by the developers themselves. Maybe it’s easier to understand that if you see that in choosing a programming language you are in many ways choosing an ontology or theoretical framework through which to conceive of problem-solving. Of course, choice of ontology will influence ones ethical principles, right?

But I digress.

So I have Hadoop running on my laptop now, and a .jar file that compiles in SBT. So now all I need to do is run the .jar using the hadoop jar command, right?

Nope, not yet…

Exception in thread "main" java.lang.NoClassDefFoundError: scala/ScalaObject

OK, so I the problem is that I haven’t included scala-library.jar on my Hadoop runtime classpath.

I solved this by making a symbolic link from the Hadoop /lib directory to the .jar in my Scala installation.

ln -s /usr/local/share/scala-2.9.2/lib/scala-library.jar /usr/local/hadoop/lib/scala-library.jar

That seemed to work, only now I have the most mundane and inscrutable of Java errors to deal with:

Exception in thread "main" java.lang.NullPointerException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:601)
	at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

I had no idea how to proceed from here. This time, no helpful folks in the #hadoop channel helped me either.

So once again I switched to a new hunk of code to work from, this time the WordCount.scala file from Derrick Cheng’s ScalaOnHadoop project. Derrick posted a link to this project on our course’s Piazza earlier, which was awesome.

Another digression. There’s a lot of talk now about on-line education. People within the university context feel vaguely threatened by MOOCs, believing that there will be a superstar effect that advantages first movers, but many are uncomfortable making that first move. Taek Lim in my department is starting to studying user interfaces to support collaboration in on-line learning.

My own two cents on this are that the open software model is about as dynamics a collaborative environment as you can get, and at the point when people started to use our course discussion management system, Piazza, as if it were a forum to discuss the assignment almost as if it was an open source mailing list, we started to get a lot more out of it and learned a lot from each other. I’m not the first person to see this potential of the open source model for education, of course. I’m excited to be attending the POSSE workshop, which is about that intersection, this summer. At this rate, it looks like I will be co-teaching a course along these lines in the Fall targeted at the I School Masters students, which is exciting!

So, anyway, I’m patching Derrick Cheng’s code. I’m not working with BIDMat yet so I’m leaving that out of the build, so I remove all references to that and I get the thing to compile…and run! I got somebody else’s Scala WordCount to run!

This seems like such a triumph. It’s taken a lot of tinkering and beating my head against a wall to get this far. (Not all at once–I’ve wrote this post over the course of several days. I realize that it’s a little absurd.)

But wait! I’m not out of the woods yet. I check the output of my MapReduce job with

hadoop fs -cat test-output/part-r-00000

and the end of my output file looks like this:

§	1
§	1
§	1
§	1
§	1
§	1
§	1
§	1
§	1
Æschylus	1
Æsop	1
Æsop	1
Æsop	1
É	1
Élus,_	1
à	1
à	1
à	1
æons	1
æsthetic	1
è	1
è	1
è	1
état_.	1
The	1
The	1

What’s going on here? Well, it looks like I successfully mapped each occurrence of each word in the original text to a key-value pair. But something went wrong in the reduce step, that was supposed to combine all the occurrences into a single count for each word.

That is just fine for me, because I’d rather be using my own Reduce function. Because it uses Scala’s functional reduceLeft, which is sweet! Why even write a Map Reduce job in a functional programming language if you can’t use a built=in language reduce in the Reduce step?

Ok, mine doesn’t work either.

Apparently, the reason for this is that the type signature I’ve been using for the Reducer’s reduce method has been wrong all along. And when that happens, the code compiles but Reducer runs its default reduce function, which is the identity function.

It’s almost (almost!) as if it would have made more sense to start by just reading the docs and following the instructions.

Now I have edited the map and reduce functions so that they have the right type signatures. To get this right exactly, I looked at a different file. I also tinkered

At last, it works.

Now, at last, I understand how this Context member class works. The problem was that I was trying to use it with the mapred.Mapper class from the old API. So much of my goose chase was due to not reading things carefully enough.

On the other hand, I feel enriched by the whole meandering process. Realizing that my programming faults were mine and not due to the complexity of the tools I was using paradoxically gives me more confidence in my understanding of the tools moving forward. And engaging with the distributed expertise on the subject–through StackOverflow, documentation generated from the original coders, helpful folks on IRC, blog posts, and so on–is much more compelling when one is driven by concrete problem-solving goals, even when those goals are somewhat arbitrary. Had I learned to use Hadoop in a less circuitous way, my understanding would probably be much more brittle. I am integrating new knowledge of Hadoop, Scala, and Java (it’s been a long time) with existing background knowledge. After a good night’s sleep, with any luck it will be part of my lifeworld!

This is the code I wound up with, by the way. I don’t suggest you use it.

Notes on using the neo4j-scala package, Part 1

Encouraged by the reception of last week’s hacking notes, I’ve decided to keep experimenting with Neo4j and Scala. Taking Michael Hunger’s advice, I’m looking into the neo4j-scala package. My goal is to port my earlier toy program to this library to take advantage of more Scala language features.

These my notes from stumbling through it. I’m halfway through.

To start with, I had trouble wrangling the dependencies. Spoiled by scripting languages, I’ve been half-assing my way around Maven for years, so I got burned a bit.

What happened was that in earlier messing around in my project, I had installed an earlier version of neo4j-scala from a different github repository. Don’t use that one. At the time of this writing, FaKoD‘s version is much more up to date and featureful.

I was getting errors that looked like this:

> [error] error while loading Neo4jWrapper, Scala signature Neo4jWrapper has
> wrong version
> [error]  expected: 5.0
> [error]  found: 4.1 in
> /home/sb/.ivy2/cache/org.neo4j/neo4j-scala/bundles/neo4j-scala-0.9.9-SNAPSHOT.jar(org/neo4j/scala/Neo4jWrapper.class)

The only relevant web pages I could find on this suggested that the problem had to due with having compiled the dependency in a different version of Scala. Since I had the Ubuntu package installed, which is pegged at 2.7.7, this seemed plausible. I went through a lot of flailing to reinstall Scala and rebuild the package, but to no avail.

That wasn’t the problem. Rather, when I asked him about it FaKoD patiently pointed out that older library has version 0.9.9-SNAPSHOT, whereas the newer one is version 0.1.0-SNAPSHOT. So, my sbt build configuration file has this line now:

libraryDependencies += "org.neo4j" % "neo4j-scala" % "0.1.0-SNAPSHOT"

Thanks to FaKoD’s walking me through these problems, I stopped getting cryptic errors and could start hacking.

Here’s what I had to start with, copying out of one of the neo4j-scala’s tests:

import org.neo4j.kernel.EmbeddedGraphDatabase
import org.neo4j.graphdb._
import collection.JavaConversions._
import org.neo4j.scala.{EmbeddedGraphDatabaseServiceProvider, Neo4jWrapper}

class Krow extends Neo4jWrapper with EmbeddedGraphDatabaseServiceProvider {
}

Running this in sbt, I get this error:

[error] /home/sb/dev/krow/src/main/scala/Krow.scala:6: class Krow needs to be abstract, /
since method neo4jStoreDir in trait EmbeddedGraphDatabaseServiceProvider /
of type => String is not defined

That’s because EmbeddedGraphDatabaseServiceProvider (this code is written by a German, I gather) has an abstract method that I haven’t defined.

What I find neat is that this is an abstract method–it’s a function that takes no arguments and returns a String. But Scala seems smart enough to allow this to be defined by either methods or more naturally variables. So, this compiles:

class Krow extends Neo4jWrapper with EmbeddedGraphDatabaseServiceProvider {
  val neo4jStoreDir = "var/graphdb"
}

but so does this:

class Krow extends Neo4jWrapper with EmbeddedGraphDatabaseServiceProvider {
  def neo4jStoreDir = {
    var a = "var/"
    var b = "graphdb"

    a + b
  }

(Functions in Scala can be defined by a block of code in curly braces, with the last line evaluated and returned.)

Next, I worked on rewriting my toy app, using this unittest as a guide.

Here was the code from my original experiment:

    var first : Node = null
    var second : Node = null

    val neo: GraphDatabaseService = new EmbeddedGraphDatabase("var/graphdb")
    var tx: Transaction = neo.beginTx()

    implicit def string2relationshipType(x: String) = DynamicRelationshipType.withName(x)

    try {
      first = neo.createNode()
      first.setProperty("name","first")

      second = neo.createNode()
      second.setProperty("name","second")

      first.createRelationshipTo(second, "isRelatedTo" : RelationshipType)

      tx.success()
      println("added nodes")
    } catch {
      case e: Exception => println(e)
    } finally {
      tx.finish() // wrap in try, finally   

      println("finished transaction 1")
    }

You could see why I would like it to be more concise. Here’s a first pass on what neo4j-scala let me whittle it down to:

    var first : Node = null
    var second : Node = null

    withTx {
      neo =>
          first = neo.gds.createNode()  
          first.setProperty("name","first")

          second = neo.gds.createNode()
          second.setProperty("name","second")

          first --> "isRelatedTo" --> second
    }

There is a lot of magic going on and it took me a while to get my head around it.

The point of withTx is to wrap around the try/success/finally pattern needed for most Neo4j transactions. Here’s the code for it:

  def withTx[T <: Any](operation: DatabaseService => T): T = {
    val tx = synchronized {
      ds.gds.beginTx
    }
    try {
      val ret = operation(ds)
      tx.success
      return ret
    } finally {
      tx.finish
    }
  }

Coming from years of JavaScript and Python, it was tough getting my head around this type signature. The syntax alone is daunting. But what I think it comes down to is this:

  • withTx takes a type parameter, T, which can be a subclass (<:) of Any.
  • It takes an argument, operation, which must be a function from something of type DatabaseService to something of type T.
  • It returns type T.

In practice, this means that the function can be called in a way that’s agnostic to the return type of its argument. But what is this DatabaseService argument?

In neo4j-scala, DatabaseService is a trait that wraps a Neo4j GraphDatabaseService. Then a GraphDatabaseServiceProvider wraps the DatabaseService. Application code is as far as I can tell expected to doubly inherit from both Neo4jWrapper, which handles the syntactic sugar, and a GraphDatabaseServiceProvider that provides the context for the sugar.

Which means that somewhere deep in the structure of our main object there is a DatabaseService that has real ultimate power over the database. withTx will find it for us, but we need to send it an operation that binds to it.

neo4j-scala also provides this helpful method, which operates in the context where that DatabaseService is available:

  def createNode(implicit ds: DatabaseService): Node = ds.gds.createNode

createNode‘s argument is implicit and so is plucked otherwise unbidden from its environment. And since Scala lets you call methods that have no arguments without parentheses, we can shorten the code further.

    withTx {
      implicit neo =>
          first = createNode
          first.setProperty("name","first")

          second = createNode
          second.setProperty("name","second")

          println("added nodes")

          // uses neo4j-scala special syntax
          first --> "isRelatedTo" --> second
    }

Notice that I had to put an implicit before neo in this code. When I didn’t, I got this error:

[error] /home/sb/dev/krow/src/main/scala/Krow.scala:23: /
 could not find implicit value for parameter ds: org.neo4j.scala.DatabaseService
[error]           first = createNode

What I think is happening is that in order to make the DatabaseService, neo, available as an implicit argument of the createNode method, we have to mark it as available with the implicit keyword.

See this page for reference:

The actual arguments that are eligible to be passed to an implicit parameter fall into two categories:

* First, eligible are all identifiers x that can be accessed at the point of the method call without a prefix and that denote an implicit definition or an implicit parameter.
* Second, eligible are also all members of companion modules of the implicit parameter’s type that are labeled implicit.

The other interesting thing going on here is this line:

          first --> "isRelatedTo" --> second

This makes a Neo4j relationship between first and second of type “isRelatedTo.”

I have no idea how the code that makes this happen works. Looking at it hurts my head. I think there may be black magic involved.

This has been slow going, since I’m learning as I’m going. I’m not done yet, though. The code I’m converting had some code to do a short traversal between my two nodes, printing their names. I’m going to leave that to Part 2.

Neo4j and Scala hacking notes

This week FOSS4G, though it has nothing in particular to do with geospatial (…yet), I’ve started hacking around graph database Neo4j in Scala because I’m convinced both are the future. I’ve had almost no experience with either.

Dwins kindly held my hand through this process. He knows a hell of a lot about Scala and guided me through how some of the language features could help me work with the Neo4j API. In this post, I will try to describe the process and problems we ran into and parrot his explanations.

I wrote some graphical hello world code to test things out in a file called Krow.scala (don’t ask). I’ll walk you through it:

import org.neo4j.kernel.EmbeddedGraphDatabase
import org.neo4j.graphdb._
import collection.JavaConversions._

I wanted to code against an embedded database, rather than code against the Neo4j server, because I have big dreams of combining Neo4j with some other web framework and don’t like have to start and stop databases. So I needed EmbeddedGraphDatabase, which implements the GraphDatabaseService interface, and persists its data to a directory of files.

I’ll talk about the JavaConversions bit later.

object Krow extends Application {

I am a lazy programmer who only bothers to encapsulate things into software architecture at the last minute. I’m also spoiled by Python and JavaScript and intimidated by the idea of code compilation. So initially I wanted to write this as an interpreted script so I wouldn’t have to think about it. But I’ve heard great things about sbt (simple-build-tool) so I figured I’d try it out.

Using sbt was definitely worth it, if only because it is really well documented and starting up my project with it got me back into the mindset of Java development enough to get Dwins to explain Maven repositories to me again. Adding dependencies to an sbt project involves writing Scala itself, which is a nice way to ease into the language.

But running my project in sbt meant I needed a main method on by lame-o script. Ugh. That sounded like too much work for me, and args: Array[String] looks ugly and typing it ruins my day.

Dwins recommended I try using Scala’s Application trait. He explained that this would take code from an object’s body and do some magic to turn it into a main method. Rock on!

Of course, I didn’t bother to check the documentation or anything. Otherwise, I would have seen this:

The Application trait can be used to quickly turn objects into executable programs, but is not recommended.

For a language that is so emphatically Correct in its design, I give a lot of credit to whoever it was that had the balls to include this language feature so that newbs could hang themselves on it. If they hadn’t, I wouldn’t have had to confront hard truths about threading. (That’s foreshadowing)

  println("start")

  val neo: GraphDatabaseService = new EmbeddedGraphDatabase("var/graphdb")

Sweet, a database I don’t have to start and stop on the command line! This var/graphdb directory is made in the directory in which I run the program (for me, using sbt run).

Next:

  var tx: Transaction = neo.beginTx()

  var first : Node = null
  var second : Node = null

  try {
    first = neo.createNode()
    first.setProperty("name","first")
    
    second = neo.createNode()
    second.setProperty("name","second")

    first.createRelationshipTo(second, "isRelatedTo")

    tx.success()
  } finally {
    println("finished transaction 1")
  }

What I’m trying to do with this code is make two nodes and a relationship between them. Should be simple.

But it turns out that with Neo4j, all modifications to the database have to be done in a transaction context, and for that you have to do this business of creating a new Transaction:

A programmatically handled transaction. All modifying operations that work with the node space must be wrapped in a transaction. Transactions are thread confined. Transactions can either be handled programmatically, through this interface, or by a container through the Java Transaction API (JTA). The Transaction interface makes handling programmatic transactions easier than using JTA programmatically. Here’s the idiomatic use of programmatic transactions in Neo4j:

 Transaction tx = graphDb.beginTx();
 try
 {
        ... // any operation that works with the node space
     tx.success();
 }
 finally
 {
     tx.finish();
 }
 

No big deal.

This bit of code was a chance for Dwins to show me a Scala feature that makes the language shine. Check out this line:

first.createRelationshipTo(second, "isRelatedTo")

If you check the documentation for this method, you can see that I’m not using this method as expect. The Java type signature is:

Relationship createRelationshipTo(Node otherNode, RelationshipType type)

where RelationshipType is a Neo4j concept that’s what it sounds like. I suppose it is important to set apart from mere Properties for performance on traversals something. RelationshipTypes can be created dynamically and seem to more or less exist in the either, but you need to provide them when you create a relationship. All relationships are of a type.

In terms of their data content, though, RelationshipTypes are just wrappers around strings. Rather than doing this wrapping in the same line that I createRelationship, Scala lets me establish a conversion from strings to RelationshipTypes in an elegant way.

You see, I lied. The above code would not have compiled had I not also included this earlier in the object’s definition:

  implicit def string2relationshipType(x: String) = DynamicRelationshipType.withName(x)

This code uses Scala’s implicit conversions to define a conversion between Strings and RelationshipTypes.

DynamicRelationshipType.withName(x) is one of Neo4j’s ways of making a new RelationshipType. Scala’s type inference means that the compiler knows that string2relationshipType returns a RelationshipType.

Since I used the implicit keyword, Scala knows that when a String is used in a method that expects a RelationshipType, it can use this function to convert it on the fly.

Check out all that majesty. Thanks, Scala!

Ok, so now I want to show that I was actually able to get something into the database. So here’s my node traversal and printing code.

  tx = neo.beginTx()

  try{
    
    val trav : Traverser = first.traverse(Traverser.Order.BREADTH_FIRST,
                                          StopEvaluator.END_OF_GRAPH,
                                          ReturnableEvaluator.ALL,
                                          "isRelatedTo",
                                          Direction.BOTH)

    for(node <- trav){
      println(node.getProperty("name"))
    }
    tx.success()
  } finally {
    tx.finish()
    println("finished transaction 2")
  }

  neo.shutdown()

  println("done")

}

Two observations:

  • traverse takes a lot of arguments, most of which seem to be these awkwardly specified static variables. I bet there’s a way to use Scala features to wrap that and make it more elegant.
  • Check out that for loop. Concise syntax that takes an iterator. There’s one catch: Traverser is a Java.lang.Iterable iterator, whereas the loop syntax requires a scala.collection.Iterable. Remember that import scala.collection.JavaConversions._ line? That imported an implicit conversion from Java to Scala iterables.

All in all, pretty straightforward stuff, I thought. Here’s what I got when I used sbt to run this project:

> run
[info] Compiling 1 Scala source to /home/sb/dev/krow/target/scala-2.9.1.final/classes...
[warn] there were 1 deprecation warnings; re-run with -deprecation for details
[warn] one warning found
[info] Running Krow 
start
finished transaction 1
finished transaction 2

That’s not what I wanted! Not only did I not get any printed acknowledgement of the nodes that I had made in the database, but program hangs and doesn’t finish.

What the hell?!

Asking Dwins about it, he tells me sagely about threads. Transactions need to be run in a single thread. The Application trait does a lot of bad stuff with threads. To be technically specific about it, it does…some really bad stuff with threads. I thought I had a handle on it when I started writing this blog post but instead I’m just going to copy/paste from the Application trait docs, which I should have read in the first place.

In practice the Application trait has a number of serious pitfalls:

* Threaded code that references the object will block until static initialization is complete. However, because the entire execution of an object extending Application takes place during static initialization, concurrent code will always deadlock if it must synchronize with the enclosing object.

Oh. Huh. That’s interesting.

It is recommended to use the App trait instead.

Now you’re talking. Let me just change that line to object Krow extends App { and I’ll be cooking in no…

> run
[info] Compiling 1 Scala source to /home/sb/dev/krow/target/scala-2.9.1.final/classes...
[warn] there were 1 deprecation warnings; re-run with -deprecation for details
[warn] one warning found
[info] Running Krow 
start
finished transaction 1
finished transaction 2

…time.

God dammit. There’s something else about App, which runs all the object code at initialization, which is causing a problem I guess. I asked Dwins what he thought.

Too much magic.

I guess I’m going to have to write a main method after all.


After some further messing around with the code, I have something that runs and prints the desired lines.

While the code would compile, I got I wound up having to explicitly name the RelationshipType type in the calls where I was trying to implicitly convert the strings; otherwise I got exceptions like this:

java.lang.IllegalArgumentException: Expected RelationshipType at var args pos 0, found isRelatedTo

Does that make it an explicit conversion?

Overall, hacking around with this makes me excited about both Scala and Neo4j despite the setbacks and wrangling.

Complete working code appended below.


import org.neo4j.kernel.EmbeddedGraphDatabase
import org.neo4j.graphdb._
import collection.JavaConversions._

object Krow {

  println("start")

  def main(args: Array[String]){

    var first : Node = null
    var second : Node = null

    val neo: GraphDatabaseService = new EmbeddedGraphDatabase("var/graphdb")
    var tx: Transaction = neo.beginTx()
      
    implicit def string2relationshipType(x: String) = DynamicRelationshipType.withName(x)

    try {
      first = neo.createNode()
      first.setProperty("name","first")
    
      second = neo.createNode()
      second.setProperty("name","second")

      first.createRelationshipTo(second, "isRelatedTo" : RelationshipType)
      
      tx.success()
      println("added nodes")
    } catch {
      case e: Exception => println(e)
    } finally {
      tx.finish() // wrap in try, finally   
      
      println("finished transaction 1")
    }

    tx = neo.beginTx()

    try{
    
      val trav : Traverser = first.traverse(Traverser.Order.BREADTH_FIRST,
                                            StopEvaluator.END_OF_GRAPH,
                                            ReturnableEvaluator.ALL,
                                            "isRelatedTo" : RelationshipType,
                                            Direction.BOTH)

      for(node <- trav){
        println(node.getProperty("name"))
      }
      tx.success()
    } finally {
      tx.finish()
      println("finished transaction 2")
    }

    neo.shutdown()

    println("done")
  }
}