Digifesto

Tag: programming

“To be great is to be misunderstood.”

A foolish consistency is the hobgoblin of little minds, adored by little statesmen and philosophers and divines. With consistency a great soul has simply nothing to do. He may as well concern himself with his shadow on the wall. Speak what you think now in hard words, and to-morrow speak what to-morrow thinks in hard words again, though it contradict every thing you said to-day. — `Ah, so you shall be sure to be misunderstood.’ — Is it so bad, then, to be misunderstood? Pythagoras was misunderstood, and Socrates, and Jesus, and Luther, and Copernicus, and Galileo, and Newton, and every pure and wise spirit that ever took flesh. To be great is to be misunderstood. –
Emerson, Self-Reliance

Lately in my serious scientific work again I’ve found myself bumping up against the limits of intelligibility. This time, it is intelligibility from within a technical community: one group of scientists who are, I’ve been advised, unfamiliar with another, different technical formalism. As a new entrant, I believe the latter would be useful to understand the domain of the former. But to do this, especially in the context of funders (who need to explain things to their own bosses in very concrete terms), would be unproductive, a waste of precious time.

Reminded by recent traffic of some notes I wrote long ago in frustration at Hannah Arendt, I found something apt about her comments. Science in the mode of what Kuhn calls “normal science” must be intelligible to itself and its benefactors. But that is all. It need not be generally intelligible to other scientists; it need not understand other scientists. It need only be a specialized and self-sustaining practice, a discipline.

Programming (which I still study) is actually quite different from science in this respect. Because software code is a medium used for communication by programmers, and software code is foremost interpreted by a compiler, one relates as a programmer to other programmers differently than the way scientists relate to other scientists. To some extent the productive formal work has moved over into software, leaving science to be less formal and more empirical. This is, in my anecdotal experience, now true even in the fields of computer science, which were once one of the bastions of formalism.

Arendt’s criticism of scientists, that should be politically distrusted because “they move in a world where speech has lost its power”, is therefore not precisely true because scientific operations are, certainly, mediated by language.

But this is normal science. Perhaps the scientists who Arendt distrusted politically were not normal scientists, but rather those sorts of scientists that were responsible for scientific revolutions. These scientist must not have used language that was readily understood by their peers, at least initially, because they were creating new concepts, new ideas.

Perhaps these kinds of scientists are better served by existentialism, as in Nietzsche’s brand, as an alternative to politics. Or by Emerson’s transcendentalism, which Sloterdijk sees as very spiritually kindred to Nietzsche but more balanced.

picking a data backend for representing email in #python

I’m at a difficult crossroads with BigBang where I need to pick an appropriate data storage backend for my preprocessed mailing list data.

There are a lot of different aspects to this problem.

The first and most important consideration is speed. If you know anything about computer science, you know that it exists to quickly execute complex tasks that would take too long to do by hand. It’s odd writing that sentence since computational complexity considerations are so fundamental to algorithm design that this can go unspoken in most technical contexts. But since coming to grad school I’ve found myself writing for a more diverse audience, so…

The problem I’m facing is that in doing exploratory data analysis, I do not know all the questions I am going to ask yet. But any particular question will be impractical to ask unless I tune the underlying infrastructure to answer it. This chicken-and-egg problem means that the process of inquiry is necessarily constrained by the engineering options that are available.

This is not new in scientific practice. Notoriously, the field of economics in the 20th century was shaped by what was analytically tractable as formal, mathematical results. The nuance of contemporary modeling of complex systems is due largely to the fact that we now have computers to do this work for us. That means we can still have the intersubjectively verified rigor that comes with mathematization without trying to fit square pegs into round holes. (Side note: something mathematicians acknowledge that others tend to miss is that mathematics is based on dialectic proof and intersubjective agreement. This makes it much closer epistemologically to something like history as a discipline than it is to technical fields dedicated to prediction and control, like chemistry or structural engineering. Computer science is in many ways an extension of mathematics. Obviously, these formalizations are then applied to great effect. Their power comes from their deep intersubjective validity–in other words, their truth. Disciplines that have dispensed with intersubjective validity as a grounds for truth claims in favor of a more nebulous sense of diverse truths in a manifold of interpretation have difficulty understanding this and so are likely to see the institutional gains of computer scientists to be a result of political manipulation, as opposed to something more basic: mastery of nature, or more provacatively, use of force. This disciplinary disfunction is one reason why these groups see their influence erode.)

For example, I have determined that in order to implement a certain query on the data efficiently, it would be best if another query were constant time. One way to do this is to use a database with an index.

However, setting up a database is something that requires extra work on the part of the programmer and so makes it harder to reproduce results. So far I have been keeping my processed email data “in memory” after it is pulled from files on the file system. This means that I have access to the data within the programming environment I’m most comfortable with, without depending on an external or parallel process. Fewer moving parts means that it is simpler to do my work.

So there is a tradeoff between the computational time of the software as it executes and the time and attention is takes me (and others that want to reproduce my results) to set up the environment in which the software runs. Since I am running this as an open source project and hope others will build on my work, I have every reason to be lazy, in a certain sense. Every inconvenience I suffer is one that will be suffered by everyone that follows me. There is a Kantian categorical imperative to keep things as simple as possible for people, to take any complex procedure and replace it with a script, so that others can do original creative thinking, solve the next problem. This is the imperative that those of us embedded in this culture have internalized. (G. Coleman notes that there are many cultures of hacking; I don’t know how prevalent these norms are, to be honest; I’m speaking from my experience) It is what makes this social process of developing our software infrastructure a social one with a modernist sense of progress. We are part of something that is being built out.

There are also social and political considerations. I am building this project intentionally in a way that is embedded within the Scientific Python ecosystem, as they are also my object of study. Certain projects are trendy right now, and for good reason. At the Python Worker’s Party at Berkeley last Friday, I saw a great presentation of Blaze. Blaze is a project that allows programmers experienced with older idioms of scientific Python programming to transfer their skills to systems that can handle more data, like Spark. This is exciting for the Python community. In such a fast moving field with multiple interoperating ecosystems, there is always the anxiety that ones skills are no longer the best skills to have. Has your expertise been made obsolete? So there is a huge demand for tools that adapt one way of thinking to a new system. As more data has become available, people have engineered new sophisticated processing backends. Often these are not done in Python, which has a reputation for being very usable and accessible but slow to run in operation. Getting the usable programming interface to interoperate with the carefully engineered data backends is hard work, work that Matt Rocklin is doing while being paid by Continuum Analytics. That is sweet.

I’m eager to try out Blaze. But as I think through the questions I am trying to ask about open source projects, I’m realizing that they don’t fit easily into the kind of data processing that Blaze currently supports. Perhaps this is dense on my part. If I knew better what I was asking, I could maybe figure out how to make it fit. But probably, what I’m looking at is data that is not “big”, that does not need the kind of power that these new tools provide. Currently my data fits on my laptop. It even fits in memory! Shouldn’t I build something that works well for what I need it for, and not worry about scaling at this point?

But I’m also trying to think long-term. What happens if an when it does scale up? What if I want to analyze ALL the mailing list data? Is that “big” data?

“Premature optimization is the root of all evil.” – Donald Knuth

@#$%! : variance annotations in Scala’s unsound parameterized types

[error] /home/sb/ischool/cs294/hw3/src/main/scala/TestScript.scala:32: type mismatch;
[error] found : Array[wikilearn.TestScript.parser.Page]
[error] required: Array[wikilearn.WikiParser#Page]
[error] Note: wikilearn.TestScript.parser.Page <: wikilearn.WikiParser#Page, but class Array is invariant in type T.
[error] You may wish to investigate a wildcard type such as `_ <: wikilearn.WikiParser#Page`. (SLS 3.2.10)

wtf, Scala.  You know exactly what I’m trying to do here.

EDIT: I sent a link to the above post to David Winslow. He responded with a crystal clear explanation that was so great I asked him if I could include it here. This is it, below:

It’s a feature, not a bug :) This is actually the specific issue that Dart had in mind when they put this note in the language spec:

The type system is unsound, due to the covariance of generic types. This is a deliberate choice (and undoubtedly controversial). Experience has shown that sound type rules for generics fly in the face of programmer intuition. It is easy for tools to provide a sound type analysis if they choose, which may be useful for tasks like refactoring.

Which of course caused some hubbub among the static typing crowd.

The whole issue comes down to the variance annotations of type parameters Variance influences how type parameters relate to the subtyping relationships of parameterized types:

Given types A and B, A is a supertype of B
trait Invariant[T] means there is no subtype relationship between Invariant[A] and invariant[B]. (Either could be used as an Invariant[_] though)
trait Covariant[+T] means Covariant[A] is a supertype of Covariant[B]
trait Contravariant[-T] means Contravariant[A] is a subtype of Contravariant[B].

The basic rule of thumb is that if you produce values of type T, you can be covariant in T, and if you consume values of type U, you can be contravariant in type U. For example, Function1 has two type parameters, the parameter type A and the result type T. it is contravariant in A and covariant in T. An (Any => String) can be used where a (String => Any) is expected, but not the other way around.

So, what about the type parameter for Array[T]? Among other operations, Arrays provide:

class Array[T] {
  def apply(i: Int): T // "producing" a T
  def update(i: Int, t: T): Unit // "consuming" a T
}

When the type parameter appears in contravariant and covariant positions the only option is to make it invariant.

Now, it’s interesting to note that in the Java language Arrays are treated as if they are covariant. This means that you can write a Java program that doesn’t use casts, passes the typechecker, and generates a type error at runtime; the body of main() would look like:

String[] strings = new String[1];
Object[] objects = strings;
objects[0] = Integer.valueOf(0); // the runtime error occurs at this step, but even if it didn't: 
System.out.println(strings[0]); // what happens here?

Anyway, the upshot is that immutable collections only use their types in covariant positions (you can get values out, but never insert) so they are much handier. Does your code work better if you replace your usage of Array with Vector? Alternatively, you can always provide the type parameter when you construct your array. Array(“”) is an Array[String], but Array[AnyRef](“”) is an Array[AnyRef].