Digifesto

turns out network backbone markets in the US are competitive after all

I’ve been depressed lately about the oligopolistic control of telecommunications for a while now. There’s the Web We’ve Lost; there’s Snowden leaks; there’s the end of net neutrality. I’ll admit a lot of my moodiness about this has been just that–moodiness. But it was moodiness tied to a particular narrative.

In this narrative, power is transmitted via flows of information. Media is, if not determinative of public opinion, determinative of how that opinion is acted up. Surveillance is also an information flow. Broadly, mid-20th century telecommunications enabled mass culture due to the uniformity of media. The Internet’s protocols allowed it to support a different kind of culture–a more participatory one. But monetization and consolidation of the infrastructure has resulted in a society that’s fragmented but more tightly controlled.

There is still hope of counteracting that trend at the software/application layer, which is part of the reason why I’m doing research on open source software production. One of my colleagues, Nick Doty, studies the governance of Internet Standards, which is another piece of the puzzle.

But if the networking infrastructure itself is centrally controlled, then all bets are off. Democracy, in the sense of decentralized power with checks and balances, would be undermined.

Yesterday I learned something new from Ashwin Mathew, another colleague who studies Internet governance at the level of network administration. The man is deep in the process of finishing up his dissertation, but he looked up from his laptop for long enough to tell me that the network backbone market is in fact highly competitive at the moment. Apparently, there was a lot of dark fiberoptic cable (“dark fiber“–meaning, no light’s going through it) laid during the first dot-com boom, which has been laying fallow and getting bought up by many different companies. Since there are many routes from A to B and excess capacity, this market is highly competitive.

Phew! So why the perception of oligopolistic control of networks? Because the consumer-facing telecom end-points ARE an oligopoly. Here there’s the last-mile problem. When wire has to be laid to every house, the economies of scale are such that it’s hard to have competitive markets. Enter Comcast etc.

I can rest easier now, because I think that this means there’s various engineering solutions to this (like AirJaldi networks? though I think those still aren’t last mile…; mesh networks?) as well as political solutions (like a local government running its last mile network as a public utility).

Protected:

This content is password protected. To view it please enter your password below:

i’ve started working on my dissertation // diversity in open source // reflexive data science

I’m studying software development and not social media for my dissertation.

That’s a bit of a false dichotomy. Much software development happens through social media.

Which is really the point–that software development is a computer mediated social process.

What’s neat is that it’s a computer mediated social process that, at its best, creates the conditions for it to continue as a social process. c.f. Kelty’s “recursive public”

What’s also neat is that this is a significant kind of labor that is not easy to think about given the tools of neoclassical economics or anything else really.

In particular I’m focusing on the development of scientific software, i.e. software that’s made and used to improve our scientific understanding of the natural world and each other.

The data I’m looking at is communications data between developers and their users. I’m including the code, under version control, as this. In addition to being communication between developers, you might think of source code as a communication between developers and machines. The process of writing code as a collaboration or conversation between people and machines.

There is a lot of this data so I get to use computational techniques to examine it. “Data science,” if you like.

But it’s also legible, readable data with readily accessible human narrative behind it. As I debug my code, I am reading the messages sent ten years ago on a mailing list. Characters begin to emerge serendipitously because their email signatures break my archive parser. I find myself Googling them. “Who is that person?”

One email I found while debugging stood out because it was written, evidently, by a woman. Given the current press on diversity in tech, I thought it was an interesting example from 2001:

From sag at hydrosphere.com Thu Nov 29 15:21:04 2001
From: sag at hydrosphere.com (Sue Giller)
Date: Thu Nov 29 15:21:04 2001
Subject: [Numpy-discussion] Re: Using Reduce with Multi-dimensional Masked array
In-Reply-To: <000201c17917$ac5efec0$3d01a8c0@plstn1.sfba.home.com>
References: <20011129174809062.AAA210@mail.climatedata.com@SUEW2000>
Message-ID: <20011129232011546.AAA269@mail.climatedata.com@SUEW2000>

Paul,

Well, you’re right. I did misunderstand your reply, as well as what
the various functions were supposed to do. I was mis-using the
sum, minimum, maximum as tho they were MA..reduce, and
my test case didn’t point out the difference. I should always have
been doing the .reduce version.

I apologize for this!

I found a section on page 45 of the Numerical Python text (PDF
form, July 13, 2001) that defines sum as
‘The sum function is a synonym for the reduce method of the add
ufunc. It returns the sum of all the elements in the sequence given
along the specified axis (first axis by default).’

This is where I would expect to see a caveat about it not retaining
any mask-edness.

I was misussing the MA.minimum and MA.maximum as tho they
were .reduce version. My bad.

The MA.average does produce a masked array, but it has changed
the ‘missing value’ to fill_value=[ 1.00000002e+020,]). I do find this
a bit odd, since the other reductions didn’t change the fill value.

Anyway, I can now get the stats I want in a format I want, and I
understand better the various functions for array/masked array.

Thanks for the comments/input.

sue

I am trying to approach this project as a quantitative scientist. But the process of developing the software for analysis is putting me in conversation not just with the laptop I run the software on, but also the data. The data is a quantified representation–I count the number of lines, even the number of characters in a line as I construct the regular expression needed to parse the headers properly–but it represents a conversation in the past. As I write the software, I consult documentation written through a process not unlike the one I am examining, as well as Stack Overflow posts written by others who have tried to perform similar tasks. And now I am writing a blog post about this work. I will tweet a link of this out to my followers; I know some people from the Scientific Python community that I am studying follow me on Twitter. Will one of them catch wind of this post? What will they think of it?

autocatalysis sustains autopoeisis

Why we need good computational models of peace and love

“Data science” doesn’t refer to any particular technique.

It refers to the cusp of the diffusion of computational methods from computer science, statistics, and applied math (the “methodologists”) to other domains.

The background theory of these disciplines–whose origin we can trace at least as far back at cybernetics research in the 1940’s–is required to understand the validity of these “data science” technologies as scientific instruments, just as a theory of optics is necessary to know the validity of what is seen through a microscope. Kuhn calls these kinds of theoretical commitments “instrumental commitments.”

For most domain sciences, instrumental commitment to information theory, computer science, etc. is not problematic. It is more so with some social sciences which oppose the validity of totalizing physics or formalism.

There aren’t a lot of them left because our mobile phones more or less instrumentally commit us to the cybernetic worldview. Where there is room for alternative metaphysics, it is because of the complexity of emergent/functional properties of the cybernetic substrate. Brier’s Cybersemiotics is one formulation of how richer communicative meaning can be seen as a evolved structure on top of cybernetic information processing.

If “software is eating the world” and we don’t want it to eat us (metaphorically! I don’t think the robots are going to kill us–I think that corporations are going to build robots that make our lives miserable by accident), then we are going to need to have software that understands us. That requires building out cybernetic models of human communication to be more understanding of our social reality and what’s desirable in it.

That’s going to require cooperation between techies and humanists in a way that will be trying for both sides but worth the effort I think.

starting with a problem

The feedback I got on my dissertation prospectus draft when I presented it to my colleagues was that I didn’t start with a problem and then argue from there how my dissertation was going to be about a solution.

That was really great advice.

The problem of problem selection is a difficult. “What is a problem?” is a question that basically nobody asks. Lots of significant philosophical traditions maintain that it’s the perception of problems as problems that is the problem. “Just chill out,” say Great Philosophical Traditions. This does not help one orient ones research dissertation.

A lot of research is motivated by interest in particular problems like an engineering challenge or curing cancer. I’m somehow managed to never acquire the kind of expertise that would allow me to address any of these specific useful problems directly. My mistake.

I’m a social scientist. There are a lot of social problems, right? Of course. However, there’s a problem here that identifying any problems as problems in the social domain immediately implicates politics.

Are there apolitical social problems? I think I’ve found some. I had a great conversation last week with Anna Salamon about Global Catastrophic Risks. Those sound terrible! It echoes the work I used to do in support of Distaster Risk Reduction, except that there is more acknowledgment in the GCR space that some of the big risks are man-made.

So there’s a problem: arguably research into the solutions to these problems is good. On the other hand, that research is complicated by the political entanglement of the researchers, especially in the university setting. It took some convincing, but OK, those politics are necessarily part of the equation. Put another way, if there wasn’t the political complexity, then the hard problems wouldn’t be such hard problems. The hard problems are hard partly because they are so political. (This difference in emphasis is not meant to preclude other reasons why these problems are hard; for example, because people aren’t smart or motivated enough.)

Given that the political complexity is getting in the way of the efficiency of us solving hard problems–because these problems require collaboration across political lines, because the inherent politics of language choice and framing create complexity that is orthogonal to the problem solution (is it?), infrastructural solutions that manage that political complexity can be helpful.

(Counterclaim: the political complexity is not illogical complexity, rather scientific logic is partly political logic. We live in the best of all possible worlds. Just chill out. This is an empirical claim.)

The promise of computational methods to interdisciplinary collaboration is that they allow for more efficient distribution of cognitive labor across the system of investigators. Data science methodologists can build tools for investigation that work cross-disciplinarily, and the interaction between these tools can follow an a political logic in a way that discursive science cannot. Teleologically, we get an Internet of Scientific Things, and autonomous scientific aparatus, and draw your own eschatological conclusions.

An interesting consequence of algorithmically mediated communication is that you don’t actually need consensus to coordinate collective action. I suppose this is an argument Hayekians etc. have been making for a long time. However, the political maintenance of the system that ensures the appropriate incentive structures is itself prone to being hacked and herein lies the problem. That and the insufficiency of the total neurological market aparatus (in Hayek’s vision) to do anything like internalize the externalities of e.g. climate change, while the Bitcoin servers burn and burn and burn.

notes

This article is making me doubt some of my earlier conclusions about the role of the steering media. Habermas, I’ve got to concede, is dated. As much as skeptics would like to show how social media fails to ‘democratize’ media (not in the sense of being justly won by elections, but rather in the original sense of being mob ruled), the fragmentation is real and the public is reciprocally involved in its own narration.

What can then be said of the role of new media in public discourse? Here are some hypotheses:

  • As a first order effect, new media exacerbates shocks, both endogenous and exogenous. See Didier Sornette‘s work on application of self-excited Hawkes process to social systems like finance and Amazon reviews. (I’m indebted to Thomas Maillart for introducing me to this research.) This changes the dynamics because rather than being Poisson distributed, new media intervention is strategically motivated.
  • As a second order effect, since new media acting strategically, it must make predictive assessments of audience receptivity. New media suppliers must anticipate and cultivate demand. But demand is driven partly by environmental factors like information availability. See these notes on Dewey’s ethical theory for how taste can be due to environmental adaptation with no truly intrinsic desire–hence, the inappropriateness of modeling these dynamics straightforwardly with ‘utility functions’–which upsets neoclassical market modeling techniques. Hence the ‘social media marketer’ position that engages regularly in communication with an audience in order to cultivate a culture that is also a media market. Microcelebrity practices achieve not merely a passively received branding but an actively nurtured communicative setting. Communication here is transmission (Shannon, etc.) and/or symbolic interaction, on which community (Carey) supervenes.
  • Though not driven be neoclassical market dynamics simpliciter, new media is nevertheless competitive. We should expect new media suppliers to be fluidly territorial. The creates a higher-order incentive for curatorial intervention to maintain and distinguish ones audience as culture. A critical open question here is to what extent these incentives drive endogenous differentiation, vs. to what extent media fragmentation results in efficient allocation of information (analogously to efficient use of information in markets.) There is no a priori reason to suppose that the ad hoc assemblage of media infrastructures and regulations minimizes negative cultural externalities. (What are examples of negative cultural externalities? Fascism, ….)
  • Different media markets will have different dialects, which will have different expressive potential because of description lengths of concepts. (Algorithmic information theoretic interpretation of weak Sapir-Whorf hypothesis.) This is unavoidable because man is mortal (cannot approach convergent limits in a lifetime.) Some consequences (which have taken me a while to come around to, but here it is):
    1. Real intersubjective agreement is only provisionally and locally attainable.
    2. Language use, as a practical effect, has implications for future computational costs and therefore is intrinsically political.
    3. The poststructuralists are right after all. ::shakes fist at sky::
    4. That’s ok, we can still hack nature and create infrastructure; technical control resonates with physical computational layers that are not subject to wetware limitations. This leaves us, disciplinarily, with post-positivist engineering, post-structuralist hermeneutics enabling only provisional consensus and collective action (which can, at best, be ‘society made durable’ via technical implementation or cultural maintenance (see above on media market making), and critical reflection (advancing social computation directly).
  • There is a challenge to Pearl/Woodward causality here, in that mechanistic causation will be insensitive to higher-order effects. A better model for social causation would be Luhmann’s autopoieisis (c.f Brier, 2008). Ecological modeling (Ulanowicz) provides the best toolkit for showing interactions between autopoietic networks?

This is not helping me write my dissertation prospectus at all.

real talk

So I am trying to write a dissertation prospectus. It is going…OK.

The dissertation is on Evaluating Data Science Environments.

But I’ve been getting very distracted by the politics of data science. I have been dealing with the politics by joking about them. But I think I’m in danger of being part of the problem, when I would rather be part of the solution.

So, where do I stand on this, really?

Here are some theses:

  • There is a sense of “data science” that is importantly different from “data analytics”, though there is plenty of abuse of the term in an industrial context. That claim is awkward because industry can easily say they “own” the term. It would be useful to lay out specifically which computational methods constitute “data science” methods and which don’t.
  • I think that it’s useful analytically to distinguish different kinds of truth claims because it sheds light on the value of different kinds of inquiry. There is definitely a place for rigorous interpretive inquiry and critical theory in addition to technical, predictive science. I think politicing around these divisions is lame and only talk about it to make fun of the situation.
  • New computational science techniques have done and will continue to do amazing work in the physical and biological and increasingly environmental sciences. I am jealous of researchers in those fields because I think that work is awesome. For some reason I am a social scientist.
  • The questions surrounding the application of data science to social systems (which can include environmental systems) are very, very interesting. Qualitative researchers get defensive about their role in “the age of data science” but I think this is unwarranted. I think it’s the quantitative social science researchers who are likely more threatened methodologically. But since I’m not well-trained as a quantitative social scientist really, I can’t be sure of that.
  • The more I learn about research methods (which seems to be all I study these days, instead of actually doing research–I’m procrastinating), the more I’m getting a nuanced sense of how different methods are designed to address different problems. Jockeying about which method is better is useless. If there is a political battle I think is worth fighting any more, it’s the battle about whether or not transdisciplinary research is productive or possible. I hypothesize that it is. But I think this is an empirical question whose answer may be specific: how can different methods be combined effectively? I think this question gets quite deep and answering it requires getting into epistemology and statistics in a serious way.
  • What is disruptive about data science is that some people have dug down into statistics in a serious way, come up with a valid general way of analyzing things, and then automated it. That makes it in theory cheaper to pick up and apply than the quantitative techniques used by other researchers, and usable at larger scale. On the whole this is pretty good, though it is bad when people don’t understand how the tools they are using work. Automating science is a pretty good thing over all.
  • It’s really important for science, as it is automated, to be built on open tools and reproducible data because (a) otherwise there is no reason why it should have the public trust, (b) because it will remove barriers to training new scientists.
  • All scientists are going to need to know how to program. I’m very fortunate to have a technical background. A technical background is not sufficient to do science well. One can use technical skills to assist in both qualitative (visualization) and quantitative work. The ability to use tools is orthogonal to the ability to study phenomena, despite the historic connection between mathematics and computer science.
  • People conflate programming, which is increasingly a social and trade skill, with the ability to grasp high level mathematical concepts.
  • Computers are awesome. The people that make them better deserve the credit they get.
  • Sometimes I think: should I be in a computer science department? I think I would feel better about my work if I were in CS. I like the feeling of tangible progress and problem solving. I think there are a lot of really important problems to solve, and that the solutions will likely come from computer science related work. What I think I get from being in a more interdisciplinary department is a better understanding of what problems are worth solving. I don’t mean that in a way that diminishes the hard work of problem solving, which I think is really where the rubber hits the road. It is easy to complain. I don’t work as hard as computer science students. I also really like being around women. I think they are great and there aren’t enough of them in computer science departments.
  • I’m interested in modeling and improving the cooperation around open scientific software because that’s where I see there some real potential value add. I’ve been and engineer and I’ve managed engineers. Managing engineers is a lot harder than engineering, IMO. That’s because management requires navigating a social system. Social systems are really absurdly complicated compared to even individual organisms.
  • There are three reasons why it might be bad to apply data science to social systems. The first is that it could lead to extraordinarily terrible death robots. My karma is on the line. The second is that the scientific models might be too simplistic and lead to bad decisions that are insensitive to human needs. That is why it is very, very important that the existing wealth of social scientific understanding is not lost but rather translated into a more robust and reproducible form. The third reason is that social science might be in principle impossible due to its self-referential effects. This would make the whole enterprise a collosal waste of time. The first and third reasons frequently depress me. The second motivates me.
  • Infrastructure and mechanism design are powerful means of social change, perhaps the most powerful. Movements are important but civil society is so paralyzed by the steering media now that it is more valuable to analyze movements as sociotechnical organizations alongside corporations etc. than to view them in isolation from the technical substrate. There are a variety of ideological framings of this position, each with different ideological baggage. I’m less concerned with that, ultimately, than the pragmatic application of knowledge. I wish people would stop having issues with “implications for design.”
  • I said I wanted to get away from politics, but this is one other political point I actually really think is worth making, though it is generally very unpopular in academia for obvious reasons: the status differential between faculty and staff is an enormous part of the problem of the disfunction of universities. A lot of disciplinery politics are codifications of distaste for certain kinds of labor. In many disciplines, graduate students perform labor unexpertly in service of their lab’s principal investigators; this labor is a way of paying ones dues that has little to do with the intellectual work of their research expertise. Or is it? It’s entirely unclear, especially when what makes the difference between a good researcher and a great one are skills that have nothing to do with their intellectual pursuit, and when master new tools is so essential for success in ones field. But the PIs are often not able to teach these tools. What is the work of research? Who does it? Why do we consider science to be the reserve of a specialized medieval institution, and call it something else when it is done by private industry? Do academics really have a right to complain about the rise of the university administrative class?

Sorry, that got polemical again.

Data Science: It Gets the Truth!

What follows is the first draft of the introduction to my upcoming book, Data Science: It Gets the Truth! This book will be the popularized version of my dissertation, based on my experiences at the School of Information and UC Berkeley. I’m really curious to know what you think!

There are two kinds of scientists in the world: the truth haters, and the truth getters.

You can tell who is a truth hater by asking them: “With your work, are you trying to find something that’s true?”

A truth hater will tell you that there is no such thing as truth, or that the idea of truth is a problematic bourgeois masculinist social construct, or that truth is relative and so no, not exactly, they probably don’t mean the same thing as you do when you say ‘truth’.

Obviously, these people hate the truth. Hence, “truth haters.”

Then there are the truth getters. You ask a truth getter whether they are trying to discover the truth, and they will say “Hell yeah!” Or, more simply, “yes, that is correct.”

Truth getters love the truth. The truth is great; it’s the point of science. They get that. Hence, “truth getters.”

We are at an amazing, unique time in history. Here, at the dawn of the 21st century, we have very powerful computers and extraordinary networks of communication like never before. This means science is going through some unprecedented changes.

One of those changes is that scientists are realizing that they’ve been fighting about nothing for a long time. Scientists used to think they had to be different from each other in order to study different things. But now, we know that there is only one good way to study anything, and that is machine learning. Soon, all scientists are going to be data scientists, because science is discovering that all things can be represented as data and studied with machine learning.

Well, not all scientists. I should be more precise. I was just talking about the truth getters. Because machine learning is how we can discover the truth about everything, and truth getters get that.

Truth haters, on the other hand, hate how good machine learning is at discovering the truth about everything. Silly truth haters! One day, they will get their funding cut.

In this book, Data Science: It Gets the Truth! you will learn how you too can be a data scientist and learn the truth about things. Get it? Great! Let’s go!

Kolmogorov Complexity is fucking brilliant

Dear Everybody Who Thinks,

Have you ever heard of Kolmogorov Complexity?

It is one of the most amazing concepts ever invented. I shit you not.

But first: if you think about it, we are all basically becoming part of one big text, right? Like, that is what we mean when we say “Big Data”. It’s that we are rapidly turning everything into a string of letters or numbers (they are roughly the same) which we then reinterpret.

Sometimes those strings are people’s creation/expression. Some of them are sensed from nature.

We interpret them by processing them through a series of filters, or lenses. Or, algorithms. Or what some might call “perceptual concepts.” They result in taking action. You could call it cybernetic action. Just as the perception of pain causes through the nerves a reflexive leap away from the stimulus, the perception of spam through an algorithmic classifier can trigger an algorithmic parry.

Turns out, if everything is text that we process with computers, there’s a really amazing metric of complexity that applies to strings of characters. That is super. Because, let’s face it, most of the time when people talk about “complexity” they aren’t using the term very precisely at all. That is too bad, because everything is really complicated, but some things are more complicated than others. It could be useful–maybe absurdly useful–to have a good way to talk about how differently complicated things are.

There is a good way. A good way called Kolmogorov Complexity. This is the definition of it provided by Wolfram Mathworld:

The complexity of a pattern parameterized as the shortest algorithm required to reproduce it. Also known as bit complexity.

I am–and this is totally a distraction but whatever–totally flummoxed as to how this definition could ever have been written. Because half of it, the first sentence, is the best humanese definition I’ve ever heard, maybe, until I reread it and realized I don’t really get the sense in which they are using “parameterized.” Do you?

But essentially, a pattern’s (or text’s) Kolmogorov complexity is the length of the shortest algorithm that produces it. Length, here, as in length of that algorithm represented as another text–i.e., in software code. (Of course you need a reader/observer/language/interpreter that can understand the code and react by writing/showing/speaking the pattern, which is where it gets complicated, but…) Where was I? (Ever totally space out come back to wondering whether or not you closed a parenthesis?)

No seriously, where was I?

Ok, so here’s the thing–a pattern’s Kolmogorov complexity is always less than or equal to its length plus a small constant. So for example, this string:

6ORoQK2szdpxArct4dja
HleN4dXTJJNPtd5qAxTg
qbIeRRYZm6NNCw9jr4J4
fGLdOyIjtbQ8RRymaFiN
oAsOHf7VkQvNkZl6ZW72

is a string of 100 random characters and there is no good way to summarize it. So, the Kolmogorov complexity of is just over 100, or the length of this string:

print("6ORoQK2szdpxArct4djaHleN4dXTJJNPtd5qAxTgqbIeRRYZm6NNCw9jr4J4fGLdOyIjtbQ8RRymaFiNoAsOHf7VkQvNkZl6ZW72")

If you know anything about programming at all (and I’m not assuming you do) you should know that the program print("Hello World!") returns as an output the string “Hello World!”. This is traditionally the first program you learn to write in any language. If you have never written a “Hello World!” program, I strongly recommend that you do. It should probably be considered a requirement for digital literacy. Here is a tutorial on how to do this for JavaScript, the language that runs in your web browser.

Now take for example this totally different string of the same length as the first one:

00000000000000000000
00000000000000000000
00000000000000000000
00000000000000000000
00000000000000000000

You might notice something about this string of numbers. Can you guess what it is?

If you guessed, “It’s really simple,” you’re right! That is the correct intuition.

A way to formalize this intuition is to talk about the shortest length of its description. You can describe it in a lot fewer than 100 characters. In English, you could write “A hundred zeros”, which is 15 characters. In the pseudocode language I’m making up for this blog post, you could write:

print(repeat("0",100))

Which is pretty short.

It turns out, understanding Kolmogorov complexity is like understanding one of the fundamental secrets of the universe. If you don’t believe me, ask the guy who invented BitCoin, whose website links to the author of the best textbook on the subject. Side note: Isn’t it incredible that the person who doxedo Satoshi Nakamoto did it through automated text analysis? Yes.

If you are unconvinced that Kolmogorov complexity is one of the most amazing ideas ever, I encourage you look further into how it plays into computational learning theory through the Minimum Description Length principle, or read this helpful introduction to Solomonoff Induction a mathematical theory of inductive inference that is as powerful as Bayesian updating.

Follow

Get every new post delivered to your Inbox.

Join 844 other followers