picking a data backend for representing email in #python
by Sebastian Benthall
I’m at a difficult crossroads with BigBang where I need to pick an appropriate data storage backend for my preprocessed mailing list data.
There are a lot of different aspects to this problem.
The first and most important consideration is speed. If you know anything about computer science, you know that it exists to quickly execute complex tasks that would take too long to do by hand. It’s odd writing that sentence since computational complexity considerations are so fundamental to algorithm design that this can go unspoken in most technical contexts. But since coming to grad school I’ve found myself writing for a more diverse audience, so…
The problem I’m facing is that in doing exploratory data analysis, I do not know all the questions I am going to ask yet. But any particular question will be impractical to ask unless I tune the underlying infrastructure to answer it. This chicken-and-egg problem means that the process of inquiry is necessarily constrained by the engineering options that are available.
This is not new in scientific practice. Notoriously, the field of economics in the 20th century was shaped by what was analytically tractable as formal, mathematical results. The nuance of contemporary modeling of complex systems is due largely to the fact that we now have computers to do this work for us. That means we can still have the intersubjectively verified rigor that comes with mathematization without trying to fit square pegs into round holes. (Side note: something mathematicians acknowledge that others tend to miss is that mathematics is based on dialectic proof and intersubjective agreement. This makes it much closer epistemologically to something like history as a discipline than it is to technical fields dedicated to prediction and control, like chemistry or structural engineering. Computer science is in many ways an extension of mathematics. Obviously, these formalizations are then applied to great effect. Their power comes from their deep intersubjective validity–in other words, their truth. Disciplines that have dispensed with intersubjective validity as a grounds for truth claims in favor of a more nebulous sense of diverse truths in a manifold of interpretation have difficulty understanding this and so are likely to see the institutional gains of computer scientists to be a result of political manipulation, as opposed to something more basic: mastery of nature, or more provacatively, use of force. This disciplinary disfunction is one reason why these groups see their influence erode.)
For example, I have determined that in order to implement a certain query on the data efficiently, it would be best if another query were constant time. One way to do this is to use a database with an index.
However, setting up a database is something that requires extra work on the part of the programmer and so makes it harder to reproduce results. So far I have been keeping my processed email data “in memory” after it is pulled from files on the file system. This means that I have access to the data within the programming environment I’m most comfortable with, without depending on an external or parallel process. Fewer moving parts means that it is simpler to do my work.
So there is a tradeoff between the computational time of the software as it executes and the time and attention is takes me (and others that want to reproduce my results) to set up the environment in which the software runs. Since I am running this as an open source project and hope others will build on my work, I have every reason to be lazy, in a certain sense. Every inconvenience I suffer is one that will be suffered by everyone that follows me. There is a Kantian categorical imperative to keep things as simple as possible for people, to take any complex procedure and replace it with a script, so that others can do original creative thinking, solve the next problem. This is the imperative that those of us embedded in this culture have internalized. (G. Coleman notes that there are many cultures of hacking; I don’t know how prevalent these norms are, to be honest; I’m speaking from my experience) It is what makes this social process of developing our software infrastructure a social one with a modernist sense of progress. We are part of something that is being built out.
There are also social and political considerations. I am building this project intentionally in a way that is embedded within the Scientific Python ecosystem, as they are also my object of study. Certain projects are trendy right now, and for good reason. At the Python Worker’s Party at Berkeley last Friday, I saw a great presentation of Blaze. Blaze is a project that allows programmers experienced with older idioms of scientific Python programming to transfer their skills to systems that can handle more data, like Spark. This is exciting for the Python community. In such a fast moving field with multiple interoperating ecosystems, there is always the anxiety that ones skills are no longer the best skills to have. Has your expertise been made obsolete? So there is a huge demand for tools that adapt one way of thinking to a new system. As more data has become available, people have engineered new sophisticated processing backends. Often these are not done in Python, which has a reputation for being very usable and accessible but slow to run in operation. Getting the usable programming interface to interoperate with the carefully engineered data backends is hard work, work that Matt Rocklin is doing while being paid by Continuum Analytics. That is sweet.
I’m eager to try out Blaze. But as I think through the questions I am trying to ask about open source projects, I’m realizing that they don’t fit easily into the kind of data processing that Blaze currently supports. Perhaps this is dense on my part. If I knew better what I was asking, I could maybe figure out how to make it fit. But probably, what I’m looking at is data that is not “big”, that does not need the kind of power that these new tools provide. Currently my data fits on my laptop. It even fits in memory! Shouldn’t I build something that works well for what I need it for, and not worry about scaling at this point?
But I’m also trying to think long-term. What happens if an when it does scale up? What if I want to analyze ALL the mailing list data? Is that “big” data?
“Premature optimization is the root of all evil.” – Donald Knuth