I’m studying software development and not social media for my dissertation.
That’s a bit of a false dichotomy. Much software development happens through social media.
Which is really the point–that software development is a computer mediated social process.
What’s neat is that it’s a computer mediated social process that, at its best, creates the conditions for it to continue as a social process. c.f. Kelty’s “recursive public”
What’s also neat is that this is a significant kind of labor that is not easy to think about given the tools of neoclassical economics or anything else really.
In particular I’m focusing on the development of scientific software, i.e. software that’s made and used to improve our scientific understanding of the natural world and each other.
The data I’m looking at is communications data between developers and their users. I’m including the code, under version control, as this. In addition to being communication between developers, you might think of source code as a communication between developers and machines. The process of writing code as a collaboration or conversation between people and machines.
There is a lot of this data so I get to use computational techniques to examine it. “Data science,” if you like.
But it’s also legible, readable data with readily accessible human narrative behind it. As I debug my code, I am reading the messages sent ten years ago on a mailing list. Characters begin to emerge serendipitously because their email signatures break my archive parser. I find myself Googling them. “Who is that person?”
One email I found while debugging stood out because it was written, evidently, by a woman. Given the current press on diversity in tech, I thought it was an interesting example from 2001:
From sag at hydrosphere.com Thu Nov 29 15:21:04 2001
From: sag at hydrosphere.com (Sue Giller)
Date: Thu Nov 29 15:21:04 2001
Subject: [Numpy-discussion] Re: Using Reduce with Multi-dimensional Masked array
In-Reply-To: <000201c17917$ac5efec0$3d01a8c0@plstn1.sfba.home.com>
References: <20011129174809062.AAA210@mail.climatedata.com@SUEW2000>
Message-ID: <20011129232011546.AAA269@mail.climatedata.com@SUEW2000>Paul,
Well, you’re right. I did misunderstand your reply, as well as what
the various functions were supposed to do. I was mis-using the
sum, minimum, maximum as tho they were MA..reduce, and
my test case didn’t point out the difference. I should always have
been doing the .reduce version.I apologize for this!
I found a section on page 45 of the Numerical Python text (PDF
form, July 13, 2001) that defines sum as
‘The sum function is a synonym for the reduce method of the add
ufunc. It returns the sum of all the elements in the sequence given
along the specified axis (first axis by default).’This is where I would expect to see a caveat about it not retaining
any mask-edness.I was misussing the MA.minimum and MA.maximum as tho they
were .reduce version. My bad.The MA.average does produce a masked array, but it has changed
the ‘missing value’ to fill_value=[ 1.00000002e+020,]). I do find this
a bit odd, since the other reductions didn’t change the fill value.Anyway, I can now get the stats I want in a format I want, and I
understand better the various functions for array/masked array.Thanks for the comments/input.
sue
I am trying to approach this project as a quantitative scientist. But the process of developing the software for analysis is putting me in conversation not just with the laptop I run the software on, but also the data. The data is a quantified representation–I count the number of lines, even the number of characters in a line as I construct the regular expression needed to parse the headers properly–but it represents a conversation in the past. As I write the software, I consult documentation written through a process not unlike the one I am examining, as well as Stack Overflow posts written by others who have tried to perform similar tasks. And now I am writing a blog post about this work. I will tweet a link of this out to my followers; I know some people from the Scientific Python community that I am studying follow me on Twitter. Will one of them catch wind of this post? What will they think of it?
autocatalysis sustains autopoeisis
“Data science” doesn’t refer to any particular technique.
It refers to the cusp of the diffusion of computational methods from computer science, statistics, and applied math (the “methodologists”) to other domains.
The background theory of these disciplines–whose origin we can trace at least as far back at cybernetics research in the 1940’s–is required to understand the validity of these “data science” technologies as scientific instruments, just as a theory of optics is necessary to know the validity of what is seen through a microscope. Kuhn calls these kinds of theoretical commitments “instrumental commitments.”
For most domain sciences, instrumental commitment to information theory, computer science, etc. is not problematic. It is more so with some social sciences which oppose the validity of totalizing physics or formalism.
There aren’t a lot of them left because our mobile phones more or less instrumentally commit us to the cybernetic worldview. Where there is room for alternative metaphysics, it is because of the complexity of emergent/functional properties of the cybernetic substrate. Brier’s Cybersemiotics is one formulation of how richer communicative meaning can be seen as a evolved structure on top of cybernetic information processing.
If “software is eating the world” and we don’t want it to eat us (metaphorically! I don’t think the robots are going to kill us–I think that corporations are going to build robots that make our lives miserable by accident), then we are going to need to have software that understands us. That requires building out cybernetic models of human communication to be more understanding of our social reality and what’s desirable in it.
That’s going to require cooperation between techies and humanists in a way that will be trying for both sides but worth the effort I think.
The feedback I got on my dissertation prospectus draft when I presented it to my colleagues was that I didn’t start with a problem and then argue from there how my dissertation was going to be about a solution.
That was really great advice.
The problem of problem selection is a difficult. “What is a problem?” is a question that basically nobody asks. Lots of significant philosophical traditions maintain that it’s the perception of problems as problems that is the problem. “Just chill out,” say Great Philosophical Traditions. This does not help one orient ones research dissertation.
A lot of research is motivated by interest in particular problems like an engineering challenge or curing cancer. I’m somehow managed to never acquire the kind of expertise that would allow me to address any of these specific useful problems directly. My mistake.
I’m a social scientist. There are a lot of social problems, right? Of course. However, there’s a problem here that identifying any problems as problems in the social domain immediately implicates politics.
Are there apolitical social problems? I think I’ve found some. I had a great conversation last week with Anna Salamon about Global Catastrophic Risks. Those sound terrible! It echoes the work I used to do in support of Distaster Risk Reduction, except that there is more acknowledgment in the GCR space that some of the big risks are man-made.
So there’s a problem: arguably research into the solutions to these problems is good. On the other hand, that research is complicated by the political entanglement of the researchers, especially in the university setting. It took some convincing, but OK, those politics are necessarily part of the equation. Put another way, if there wasn’t the political complexity, then the hard problems wouldn’t be such hard problems. The hard problems are hard partly because they are so political. (This difference in emphasis is not meant to preclude other reasons why these problems are hard; for example, because people aren’t smart or motivated enough.)
Given that the political complexity is getting in the way of the efficiency of us solving hard problems–because these problems require collaboration across political lines, because the inherent politics of language choice and framing create complexity that is orthogonal to the problem solution (is it?), infrastructural solutions that manage that political complexity can be helpful.
(Counterclaim: the political complexity is not illogical complexity, rather scientific logic is partly political logic. We live in the best of all possible worlds. Just chill out. This is an empirical claim.)
The promise of computational methods to interdisciplinary collaboration is that they allow for more efficient distribution of cognitive labor across the system of investigators. Data science methodologists can build tools for investigation that work cross-disciplinarily, and the interaction between these tools can follow an a political logic in a way that discursive science cannot. Teleologically, we get an Internet of Scientific Things, and autonomous scientific aparatus, and draw your own eschatological conclusions.
An interesting consequence of algorithmically mediated communication is that you don’t actually need consensus to coordinate collective action. I suppose this is an argument Hayekians etc. have been making for a long time. However, the political maintenance of the system that ensures the appropriate incentive structures is itself prone to being hacked and herein lies the problem. That and the insufficiency of the total neurological market aparatus (in Hayek’s vision) to do anything like internalize the externalities of e.g. climate change, while the Bitcoin servers burn and burn and burn.
So I am trying to write a dissertation prospectus. It is going…OK.
The dissertation is on Evaluating Data Science Environments.
But I’ve been getting very distracted by the politics of data science. I have been dealing with the politics by joking about them. But I think I’m in danger of being part of the problem, when I would rather be part of the solution.
So, where do I stand on this, really?
Here are some theses:
Sorry, that got polemical again.
What follows is the first draft of the introduction to my upcoming book, Data Science: It Gets the Truth! This book will be the popularized version of my dissertation, based on my experiences at the School of Information and UC Berkeley. I’m really curious to know what you think!
There are two kinds of scientists in the world: the truth haters, and the truth getters.
You can tell who is a truth hater by asking them: “With your work, are you trying to find something that’s true?”
A truth hater will tell you that there is no such thing as truth, or that the idea of truth is a problematic bourgeois masculinist social construct, or that truth is relative and so no, not exactly, they probably don’t mean the same thing as you do when you say ‘truth’.
Obviously, these people hate the truth. Hence, “truth haters.”
Then there are the truth getters. You ask a truth getter whether they are trying to discover the truth, and they will say “Hell yeah!” Or, more simply, “yes, that is correct.”
Truth getters love the truth. The truth is great; it’s the point of science. They get that. Hence, “truth getters.”
We are at an amazing, unique time in history. Here, at the dawn of the 21st century, we have very powerful computers and extraordinary networks of communication like never before. This means science is going through some unprecedented changes.
One of those changes is that scientists are realizing that they’ve been fighting about nothing for a long time. Scientists used to think they had to be different from each other in order to study different things. But now, we know that there is only one good way to study anything, and that is machine learning. Soon, all scientists are going to be data scientists, because science is discovering that all things can be represented as data and studied with machine learning.
Well, not all scientists. I should be more precise. I was just talking about the truth getters. Because machine learning is how we can discover the truth about everything, and truth getters get that.
Truth haters, on the other hand, hate how good machine learning is at discovering the truth about everything. Silly truth haters! One day, they will get their funding cut.
In this book, Data Science: It Gets the Truth! you will learn how you too can be a data scientist and learn the truth about things. Get it? Great! Let’s go!
I’m facing a challenging paradox in how to approach my research.
On the one hand, we have the trend of increasing instrumentation of society. From quantified self to the Internet of things to Netflix clicks to the fully digitized archives of every newspaper, we have more data than we’ve ever had before to ask fundamental social scientific questions.
That should make it easier to research society and infer principles about how it works. But there is a long-standing counterpoint in the social sciences that claims that all social phenomena are sui generis and historically situated. If no social phenomenon generalizes, then it shouldn’t be possible to infer anything from the available data, no matter how much of it there is.
One view is that we should only be able to infer stuff that isn’t very interesting at all. One name for this view is “punctuated equilibrium.” The national borders of countries don’t move around…until they do. Regimes don’t change…until they do. It’s the ability to predict these kinds of political events that Philip Tetlock has called “expert political judgment.” The Good Judgment Project is a test to see what properties make a person or team of people good at this kind of task.
What now seems like many years ago I wrote a book review of Tetlock’s book. In that review, I pointed out a facet of Tetlock’s research I found most compelling but underdeveloped: that the best predictors he found were algorithmic predictors that drew their conclusions from linear regressions drawn from just the top three or so salient features in the data.
Six or so years later, Big Data is a powerful enough industrial and political phenomenon academic social science feels it needs to catch up. But to a large extent industrial data science is still about using pretty basic statistical models drawn from physics (that assume that everything stands in Gaussian relations to everything else, say), or otherwise applying a broad range of modeling techniques and aggregating them under statistical boosting. This is great for edge out the competition on selling ads.
But it tells us nothing about the underlying structure of what’s going on in society. And it’s possible that the fact that we haven’t done any better is really a condemnation of the whole process of social science in general. The data we are getting, rather than making us understand what’s going on around us better, is perhaps just proving to us that it’s a complex chaotic system. If so, the better we understand it, the more we will lose our confidence in our ability to predict it.
Historically, we’ve been through all this before. The mid-20th century saw the expansion of scope of Norbert Weiner’s cybernetics from electrical engineering of homeostatic machines to modeling of the political system and the economy as complex feedback systems. Indeed, cybernetics was intended as a theory of steering systems by thinking about their communications mechanisms. (Wikipedia: “The word “cybernetics” comes from the Greek word κυβερνητική (kyverni̱tikí̱, “government”), i.e. all that are pertinent to κυβερνώ (kyvernó̱), the latter meaning to “steer,” “navigate” or “govern,” hence κυβέρνησις (kyvérni̱sis, “government”) is the government while κυβερνήτης (kyverní̱ti̱s) is the governor or the captain.”) These models were on some level interesting and intuitive, even beautiful in their ambition. But they failed in their applications because social systems did not obey the kind of regularity that systems engineered for reliable equilibria did.
The difficulty with applying these theories that acknowledge the complexity of the social system to reality is that they are only explanatory in retrospect because other the path dependence of history. That’s pretty close to rendering them pseudoscientific.
Nevertheless, there are countless pressing societal challenges–climate change, unfair crime laws, war, political crisis, public health policy–on which social scientific research must be brought to bear, because there is a dimension to them which is a problem of predicting social action.
It is possible (I wonder if it’s necessary) that there are laws–perhaps just local laws–of social activity. Most people certainly believe their are. Business strategy, for example, depends on so much theorizing about the market and the relationships between different companies and their products. If these laws exist, they must be operationalizable and discoverable in the data itself.
But there is the problem of the researcher’s effect on the system being observed and, even more confounding, the result of the researcher’s discovery on the system itself. When a social system becomes self-aware through a particular theoretical lens, it can change its behavior. (I’ve heard that Milton Friedman’s monetarist economics are fantastically predictive of economic growth in the United States right up until he published them.)
If reflexivity contributes to social entropy, then it’s not clear what the point of any social research agenda is.
The one exception I can think of is if an empirical principle of social organization is robust under social reflection. The goal would be to define an equilibrium state worth striving for, so that the society in question can accept it harmoniously as a norm.
This looks like relevant prior work–a lucky google hit.