Digifesto

i’ve started working on my dissertation // diversity in open source // reflexive data science

I’m studying software development and not social media for my dissertation.

That’s a bit of a false dichotomy. Much software development happens through social media.

Which is really the point–that software development is a computer mediated social process.

What’s neat is that it’s a computer mediated social process that, at its best, creates the conditions for it to continue as a social process. c.f. Kelty’s “recursive public”

What’s also neat is that this is a significant kind of labor that is not easy to think about given the tools of neoclassical economics or anything else really.

In particular I’m focusing on the development of scientific software, i.e. software that’s made and used to improve our scientific understanding of the natural world and each other.

The data I’m looking at is communications data between developers and their users. I’m including the code, under version control, as this. In addition to being communication between developers, you might think of source code as a communication between developers and machines. The process of writing code as a collaboration or conversation between people and machines.

There is a lot of this data so I get to use computational techniques to examine it. “Data science,” if you like.

But it’s also legible, readable data with readily accessible human narrative behind it. As I debug my code, I am reading the messages sent ten years ago on a mailing list. Characters begin to emerge serendipitously because their email signatures break my archive parser. I find myself Googling them. “Who is that person?”

One email I found while debugging stood out because it was written, evidently, by a woman. Given the current press on diversity in tech, I thought it was an interesting example from 2001:

From sag at hydrosphere.com Thu Nov 29 15:21:04 2001
From: sag at hydrosphere.com (Sue Giller)
Date: Thu Nov 29 15:21:04 2001
Subject: [Numpy-discussion] Re: Using Reduce with Multi-dimensional Masked array
In-Reply-To: <000201c17917$ac5efec0$3d01a8c0@plstn1.sfba.home.com>
References: <20011129174809062.AAA210@mail.climatedata.com@SUEW2000>
Message-ID: <20011129232011546.AAA269@mail.climatedata.com@SUEW2000>

Paul,

Well, you’re right. I did misunderstand your reply, as well as what
the various functions were supposed to do. I was mis-using the
sum, minimum, maximum as tho they were MA..reduce, and
my test case didn’t point out the difference. I should always have
been doing the .reduce version.

I apologize for this!

I found a section on page 45 of the Numerical Python text (PDF
form, July 13, 2001) that defines sum as
‘The sum function is a synonym for the reduce method of the add
ufunc. It returns the sum of all the elements in the sequence given
along the specified axis (first axis by default).’

This is where I would expect to see a caveat about it not retaining
any mask-edness.

I was misussing the MA.minimum and MA.maximum as tho they
were .reduce version. My bad.

The MA.average does produce a masked array, but it has changed
the ‘missing value’ to fill_value=[ 1.00000002e+020,]). I do find this
a bit odd, since the other reductions didn’t change the fill value.

Anyway, I can now get the stats I want in a format I want, and I
understand better the various functions for array/masked array.

Thanks for the comments/input.

sue

I am trying to approach this project as a quantitative scientist. But the process of developing the software for analysis is putting me in conversation not just with the laptop I run the software on, but also the data. The data is a quantified representation–I count the number of lines, even the number of characters in a line as I construct the regular expression needed to parse the headers properly–but it represents a conversation in the past. As I write the software, I consult documentation written through a process not unlike the one I am examining, as well as Stack Overflow posts written by others who have tried to perform similar tasks. And now I am writing a blog post about this work. I will tweet a link of this out to my followers; I know some people from the Scientific Python community that I am studying follow me on Twitter. Will one of them catch wind of this post? What will they think of it?

Leave a comment

Aside | May 7, 2014

autocatalysis sustains autopoeisis

Leave a comment

April 7, 2014

Why we need good computational models of peace and love

“Data science” doesn’t refer to any particular technique.

It refers to the cusp of the diffusion of computational methods from computer science, statistics, and applied math (the “methodologists”) to other domains.

The background theory of these disciplines–whose origin we can trace at least as far back at cybernetics research in the 1940’s–is required to understand the validity of these “data science” technologies as scientific instruments, just as a theory of optics is necessary to know the validity of what is seen through a microscope. Kuhn calls these kinds of theoretical commitments “instrumental commitments.”

For most domain sciences, instrumental commitment to information theory, computer science, etc. is not problematic. It is more so with some social sciences which oppose the validity of totalizing physics or formalism.

There aren’t a lot of them left because our mobile phones more or less instrumentally commit us to the cybernetic worldview. Where there is room for alternative metaphysics, it is because of the complexity of emergent/functional properties of the cybernetic substrate. Brier’s Cybersemiotics is one formulation of how richer communicative meaning can be seen as a evolved structure on top of cybernetic information processing.

If “software is eating the world” and we don’t want it to eat us (metaphorically! I don’t think the robots are going to kill us–I think that corporations are going to build robots that make our lives miserable by accident), then we are going to need to have software that understands us. That requires building out cybernetic models of human communication to be more understanding of our social reality and what’s desirable in it.

That’s going to require cooperation between techies and humanists in a way that will be trying for both sides but worth the effort I think.

9 Comments

March 30, 2014

starting with a problem

The feedback I got on my dissertation prospectus draft when I presented it to my colleagues was that I didn’t start with a problem and then argue from there how my dissertation was going to be about a solution.

That was really great advice.

The problem of problem selection is a difficult. “What is a problem?” is a question that basically nobody asks. Lots of significant philosophical traditions maintain that it’s the perception of problems as problems that is the problem. “Just chill out,” say Great Philosophical Traditions. This does not help one orient ones research dissertation.

A lot of research is motivated by interest in particular problems like an engineering challenge or curing cancer. I’m somehow managed to never acquire the kind of expertise that would allow me to address any of these specific useful problems directly. My mistake.

I’m a social scientist. There are a lot of social problems, right? Of course. However, there’s a problem here that identifying any problems as problems in the social domain immediately implicates politics.

Are there apolitical social problems? I think I’ve found some. I had a great conversation last week with Anna Salamon about Global Catastrophic Risks. Those sound terrible! It echoes the work I used to do in support of Distaster Risk Reduction, except that there is more acknowledgment in the GCR space that some of the big risks are man-made.

So there’s a problem: arguably research into the solutions to these problems is good. On the other hand, that research is complicated by the political entanglement of the researchers, especially in the university setting. It took some convincing, but OK, those politics are necessarily part of the equation. Put another way, if there wasn’t the political complexity, then the hard problems wouldn’t be such hard problems. The hard problems are hard partly because they are so political. (This difference in emphasis is not meant to preclude other reasons why these problems are hard; for example, because people aren’t smart or motivated enough.)

Given that the political complexity is getting in the way of the efficiency of us solving hard problems–because these problems require collaboration across political lines, because the inherent politics of language choice and framing create complexity that is orthogonal to the problem solution (is it?), infrastructural solutions that manage that political complexity can be helpful.

(Counterclaim: the political complexity is not illogical complexity, rather scientific logic is partly political logic. We live in the best of all possible worlds. Just chill out. This is an empirical claim.)

The promise of computational methods to interdisciplinary collaboration is that they allow for more efficient distribution of cognitive labor across the system of investigators. Data science methodologists can build tools for investigation that work cross-disciplinarily, and the interaction between these tools can follow an a political logic in a way that discursive science cannot. Teleologically, we get an Internet of Scientific Things, and autonomous scientific aparatus, and draw your own eschatological conclusions.

An interesting consequence of algorithmically mediated communication is that you don’t actually need consensus to coordinate collective action. I suppose this is an argument Hayekians etc. have been making for a long time. However, the political maintenance of the system that ensures the appropriate incentive structures is itself prone to being hacked and herein lies the problem. That and the insufficiency of the total neurological market aparatus (in Hayek’s vision) to do anything like internalize the externalities of e.g. climate change, while the Bitcoin servers burn and burn and burn.

Leave a comment

March 28, 2014

notes

This article is making me doubt some of my earlier conclusions about the role of the steering media. Habermas, I’ve got to concede, is dated. As much as skeptics would like to show how social media fails to ‘democratize’ media (not in the sense of being justly won by elections, but rather in the original sense of being mob ruled), the fragmentation is real and the public is reciprocally involved in its own narration.

What can then be said of the role of new media in public discourse? Here are some hypotheses:

As a first order effect, new media exacerbates shocks, both endogenous and exogenous. See Didier Sornette‘s work on application of self-excited Hawkes process to social systems like finance and Amazon reviews. (I’m indebted to Thomas Maillart for introducing me to this research.) This changes the dynamics because rather than being Poisson distributed, new media intervention is strategically motivated.
As a second order effect, since new media acting strategically, it must make predictive assessments of audience receptivity. New media suppliers must anticipate and cultivate demand. But demand is driven partly by environmental factors like information availability. See these notes on Dewey’s ethical theory for how taste can be due to environmental adaptation with no truly intrinsic desire–hence, the inappropriateness of modeling these dynamics straightforwardly with ‘utility functions’–which upsets neoclassical market modeling techniques. Hence the ‘social media marketer’ position that engages regularly in communication with an audience in order to cultivate a culture that is also a media market. Microcelebrity practices achieve not merely a passively received branding but an actively nurtured communicative setting. Communication here is transmission (Shannon, etc.) and/or symbolic interaction, on which community (Carey) supervenes.
Though not driven be neoclassical market dynamics simpliciter, new media is nevertheless competitive. We should expect new media suppliers to be fluidly territorial. The creates a higher-order incentive for curatorial intervention to maintain and distinguish ones audience as culture. A critical open question here is to what extent these incentives drive endogenous differentiation, vs. to what extent media fragmentation results in efficient allocation of information (analogously to efficient use of information in markets.) There is no a priori reason to suppose that the ad hoc assemblage of media infrastructures and regulations minimizes negative cultural externalities. (What are examples of negative cultural externalities? Fascism, ….)
Different media markets will have different dialects, which will have different expressive potential because of description lengths of concepts. (Algorithmic information theoretic interpretation of weak Sapir-Whorf hypothesis.) This is unavoidable because man is mortal (cannot approach convergent limits in a lifetime.) Some consequences (which have taken me a while to come around to, but here it is):
1. Real intersubjective agreement is only provisionally and locally attainable.
2. Language use, as a practical effect, has implications for future computational costs and therefore is intrinsically political.
3. The poststructuralists are right after all. ::shakes fist at sky::
4. That’s ok, we can still hack nature and create infrastructure; technical control resonates with physical computational layers that are not subject to wetware limitations. This leaves us, disciplinarily, with post-positivist engineering, post-structuralist hermeneutics enabling only provisional consensus and collective action (which can, at best, be ‘society made durable’ via technical implementation or cultural maintenance (see above on media market making), and critical reflection (advancing social computation directly).
There is a challenge to Pearl/Woodward causality here, in that mechanistic causation will be insensitive to higher-order effects. A better model for social causation would be Luhmann’s autopoieisis (c.f Brier, 2008). Ecological modeling (Ulanowicz) provides the best toolkit for showing interactions between autopoietic networks?

This is not helping me write my dissertation prospectus at all.

Leave a comment

March 20, 2014

real talk

So I am trying to write a dissertation prospectus. It is going…OK.

The dissertation is on Evaluating Data Science Environments.

But I’ve been getting very distracted by the politics of data science. I have been dealing with the politics by joking about them. But I think I’m in danger of being part of the problem, when I would rather be part of the solution.

So, where do I stand on this, really?

Here are some theses:

There is a sense of “data science” that is importantly different from “data analytics”, though there is plenty of abuse of the term in an industrial context. That claim is awkward because industry can easily say they “own” the term. It would be useful to lay out specifically which computational methods constitute “data science” methods and which don’t.
I think that it’s useful analytically to distinguish different kinds of truth claims because it sheds light on the value of different kinds of inquiry. There is definitely a place for rigorous interpretive inquiry and critical theory in addition to technical, predictive science. I think politicing around these divisions is lame and only talk about it to make fun of the situation.
New computational science techniques have done and will continue to do amazing work in the physical and biological and increasingly environmental sciences. I am jealous of researchers in those fields because I think that work is awesome. For some reason I am a social scientist.
The questions surrounding the application of data science to social systems (which can include environmental systems) are very, very interesting. Qualitative researchers get defensive about their role in “the age of data science” but I think this is unwarranted. I think it’s the quantitative social science researchers who are likely more threatened methodologically. But since I’m not well-trained as a quantitative social scientist really, I can’t be sure of that.
The more I learn about research methods (which seems to be all I study these days, instead of actually doing research–I’m procrastinating), the more I’m getting a nuanced sense of how different methods are designed to address different problems. Jockeying about which method is better is useless. If there is a political battle I think is worth fighting any more, it’s the battle about whether or not transdisciplinary research is productive or possible. I hypothesize that it is. But I think this is an empirical question whose answer may be specific: how can different methods be combined effectively? I think this question gets quite deep and answering it requires getting into epistemology and statistics in a serious way.
What is disruptive about data science is that some people have dug down into statistics in a serious way, come up with a valid general way of analyzing things, and then automated it. That makes it in theory cheaper to pick up and apply than the quantitative techniques used by other researchers, and usable at larger scale. On the whole this is pretty good, though it is bad when people don’t understand how the tools they are using work. Automating science is a pretty good thing over all.
It’s really important for science, as it is automated, to be built on open tools and reproducible data because (a) otherwise there is no reason why it should have the public trust, (b) because it will remove barriers to training new scientists.
All scientists are going to need to know how to program. I’m very fortunate to have a technical background. A technical background is not sufficient to do science well. One can use technical skills to assist in both qualitative (visualization) and quantitative work. The ability to use tools is orthogonal to the ability to study phenomena, despite the historic connection between mathematics and computer science.
People conflate programming, which is increasingly a social and trade skill, with the ability to grasp high level mathematical concepts.
Computers are awesome. The people that make them better deserve the credit they get.
Sometimes I think: should I be in a computer science department? I think I would feel better about my work if I were in CS. I like the feeling of tangible progress and problem solving. I think there are a lot of really important problems to solve, and that the solutions will likely come from computer science related work. What I think I get from being in a more interdisciplinary department is a better understanding of what problems are worth solving. I don’t mean that in a way that diminishes the hard work of problem solving, which I think is really where the rubber hits the road. It is easy to complain. I don’t work as hard as computer science students. I also really like being around women. I think they are great and there aren’t enough of them in computer science departments.
I’m interested in modeling and improving the cooperation around open scientific software because that’s where I see there some real potential value add. I’ve been and engineer and I’ve managed engineers. Managing engineers is a lot harder than engineering, IMO. That’s because management requires navigating a social system. Social systems are really absurdly complicated compared to even individual organisms.
There are three reasons why it might be bad to apply data science to social systems. The first is that it could lead to extraordinarily terrible death robots. My karma is on the line. The second is that the scientific models might be too simplistic and lead to bad decisions that are insensitive to human needs. That is why it is very, very important that the existing wealth of social scientific understanding is not lost but rather translated into a more robust and reproducible form. The third reason is that social science might be in principle impossible due to its self-referential effects. This would make the whole enterprise a collosal waste of time. The first and third reasons frequently depress me. The second motivates me.
Infrastructure and mechanism design are powerful means of social change, perhaps the most powerful. Movements are important but civil society is so paralyzed by the steering media now that it is more valuable to analyze movements as sociotechnical organizations alongside corporations etc. than to view them in isolation from the technical substrate. There are a variety of ideological framings of this position, each with different ideological baggage. I’m less concerned with that, ultimately, than the pragmatic application of knowledge. I wish people would stop having issues with “implications for design.”
I said I wanted to get away from politics, but this is one other political point I actually really think is worth making, though it is generally very unpopular in academia for obvious reasons: the status differential between faculty and staff is an enormous part of the problem of the disfunction of universities. A lot of disciplinery politics are codifications of distaste for certain kinds of labor. In many disciplines, graduate students perform labor unexpertly in service of their lab’s principal investigators; this labor is a way of paying ones dues that has little to do with the intellectual work of their research expertise. Or is it? It’s entirely unclear, especially when what makes the difference between a good researcher and a great one are skills that have nothing to do with their intellectual pursuit, and when master new tools is so essential for success in ones field. But the PIs are often not able to teach these tools. What is the work of research? Who does it? Why do we consider science to be the reserve of a specialized medieval institution, and call it something else when it is done by private industry? Do academics really have a right to complain about the rise of the university administrative class?

Sorry, that got polemical again.

2 Comments

February 21, 2014

Data Science: It Gets the Truth!

What follows is the first draft of the introduction to my upcoming book, Data Science: It Gets the Truth! This book will be the popularized version of my dissertation, based on my experiences at the School of Information and UC Berkeley. I’m really curious to know what you think!

There are two kinds of scientists in the world: the truth haters, and the truth getters.

You can tell who is a truth hater by asking them: “With your work, are you trying to find something that’s true?”

A truth hater will tell you that there is no such thing as truth, or that the idea of truth is a problematic bourgeois masculinist social construct, or that truth is relative and so no, not exactly, they probably don’t mean the same thing as you do when you say ‘truth’.

Obviously, these people hate the truth. Hence, “truth haters.”

Then there are the truth getters. You ask a truth getter whether they are trying to discover the truth, and they will say “Hell yeah!” Or, more simply, “yes, that is correct.”

Truth getters love the truth. The truth is great; it’s the point of science. They get that. Hence, “truth getters.”

We are at an amazing, unique time in history. Here, at the dawn of the 21st century, we have very powerful computers and extraordinary networks of communication like never before. This means science is going through some unprecedented changes.

One of those changes is that scientists are realizing that they’ve been fighting about nothing for a long time. Scientists used to think they had to be different from each other in order to study different things. But now, we know that there is only one good way to study anything, and that is machine learning. Soon, all scientists are going to be data scientists, because science is discovering that all things can be represented as data and studied with machine learning.

Well, not all scientists. I should be more precise. I was just talking about the truth getters. Because machine learning is how we can discover the truth about everything, and truth getters get that.

Truth haters, on the other hand, hate how good machine learning is at discovering the truth about everything. Silly truth haters! One day, they will get their funding cut.

In this book, Data Science: It Gets the Truth! you will learn how you too can be a data scientist and learn the truth about things. Get it? Great! Let’s go!

Leave a comment

February 12, 2014

thinking about computational social science

I’m facing a challenging paradox in how to approach my research.

On the one hand, we have the trend of increasing instrumentation of society. From quantified self to the Internet of things to Netflix clicks to the fully digitized archives of every newspaper, we have more data than we’ve ever had before to ask fundamental social scientific questions.

That should make it easier to research society and infer principles about how it works. But there is a long-standing counterpoint in the social sciences that claims that all social phenomena are sui generis and historically situated. If no social phenomenon generalizes, then it shouldn’t be possible to infer anything from the available data, no matter how much of it there is.

One view is that we should only be able to infer stuff that isn’t very interesting at all. One name for this view is “punctuated equilibrium.” The national borders of countries don’t move around…until they do. Regimes don’t change…until they do. It’s the ability to predict these kinds of political events that Philip Tetlock has called “expert political judgment.” The Good Judgment Project is a test to see what properties make a person or team of people good at this kind of task.

What now seems like many years ago I wrote a book review of Tetlock’s book. In that review, I pointed out a facet of Tetlock’s research I found most compelling but underdeveloped: that the best predictors he found were algorithmic predictors that drew their conclusions from linear regressions drawn from just the top three or so salient features in the data.

Six or so years later, Big Data is a powerful enough industrial and political phenomenon academic social science feels it needs to catch up. But to a large extent industrial data science is still about using pretty basic statistical models drawn from physics (that assume that everything stands in Gaussian relations to everything else, say), or otherwise applying a broad range of modeling techniques and aggregating them under statistical boosting. This is great for edge out the competition on selling ads.

But it tells us nothing about the underlying structure of what’s going on in society. And it’s possible that the fact that we haven’t done any better is really a condemnation of the whole process of social science in general. The data we are getting, rather than making us understand what’s going on around us better, is perhaps just proving to us that it’s a complex chaotic system. If so, the better we understand it, the more we will lose our confidence in our ability to predict it.

Historically, we’ve been through all this before. The mid-20th century saw the expansion of scope of Norbert Weiner’s cybernetics from electrical engineering of homeostatic machines to modeling of the political system and the economy as complex feedback systems. Indeed, cybernetics was intended as a theory of steering systems by thinking about their communications mechanisms. (Wikipedia: “The word “cybernetics” comes from the Greek word κυβερνητική (kyverni̱tikí̱, “government”), i.e. all that are pertinent to κυβερνώ (kyvernó̱), the latter meaning to “steer,” “navigate” or “govern,” hence κυβέρνησις (kyvérni̱sis, “government”) is the government while κυβερνήτης (kyverní̱ti̱s) is the governor or the captain.”) These models were on some level interesting and intuitive, even beautiful in their ambition. But they failed in their applications because social systems did not obey the kind of regularity that systems engineered for reliable equilibria did.

The difficulty with applying these theories that acknowledge the complexity of the social system to reality is that they are only explanatory in retrospect because other the path dependence of history. That’s pretty close to rendering them pseudoscientific.

Nevertheless, there are countless pressing societal challenges–climate change, unfair crime laws, war, political crisis, public health policy–on which social scientific research must be brought to bear, because there is a dimension to them which is a problem of predicting social action.

It is possible (I wonder if it’s necessary) that there are laws–perhaps just local laws–of social activity. Most people certainly believe their are. Business strategy, for example, depends on so much theorizing about the market and the relationships between different companies and their products. If these laws exist, they must be operationalizable and discoverable in the data itself.

But there is the problem of the researcher’s effect on the system being observed and, even more confounding, the result of the researcher’s discovery on the system itself. When a social system becomes self-aware through a particular theoretical lens, it can change its behavior. (I’ve heard that Milton Friedman’s monetarist economics are fantastically predictive of economic growth in the United States right up until he published them.)

If reflexivity contributes to social entropy, then it’s not clear what the point of any social research agenda is.

The one exception I can think of is if an empirical principle of social organization is robust under social reflection. The goal would be to define an equilibrium state worth striving for, so that the society in question can accept it harmoniously as a norm.

Digifesto

Aside | June 18, 2014

Protected:

June 5, 2014

i’ve started working on my dissertation // diversity in open source // reflexive data science

Aside | May 7, 2014

April 7, 2014

Why we need good computational models of peace and love

March 30, 2014

starting with a problem

March 20, 2014

real talk

February 21, 2014

Data Science: It Gets the Truth!

February 12, 2014

thinking about computational social science

February 9, 2014

Protected: on-line actors in political theater