Digifesto

Category: Uncategorized

prediction and computational complexity

To the extent that an agent is predictable, it must be:

  • observable, and
  • have a knowable internal structure

The first implies that the predictor has collected data emitted by the agent.

The second implies that the agent has internal structure and that the predictor has the capacity to represent the internal structure of the other agent.

In general, we can say that people do not have the capacity to explicitly represent other people very well. People are unpredictable to each other. This is what makes us free. When somebody is utterly predictable to us, their rigidity is a sign of weakness or stupidity. They are following a simple algorithm.

We are able to model the internal structure of worms with available computing power.

As we build more and more powerful predictive systems, we can ask: is our internal structure in principle knowable by this powerful machine?

This is different from the question of whether or not the predictive machine has data from which to draw inferences. Though of course the questions are related in their implications.

I’ve tried to make progress on modeling this with limited success. Spiros has just told me about binary decision diagrams which are a promising lead.

objective properties of text and robot scientists

One problem with having objectivity as a scientific goal is that it may be humanly impossible.

One area where this comes up is in the reading of a text. To read is to interpret, and it is impossible to interpret without bringing ones own concepts and experience to bear on the interpretation. This introduces partiality.

This is one reason why Digital Humanities are interesting. In Digital Humanities, one is using only the objective properties of the text–its data as a string of characters and its metadata. Semantic analysis is reduced to a study of a statistical distribution over words.

An odd conclusion: the objective scientific subject won’t be a human intelligence at all. It will need to be a robot. Its concepts may never be interpretable by humans because any individual human is too small-minded or restricted in their point of view to understand the whole.

Looking at the history of cybernetics, artificial intelligence, and machine learning, we can see the progression of a science dedicated to understanding the abstract properties of an idealized, objective learner. That systems such as these underly the infrastructure we depend on for the organization of society is a testament to their success.

A troubling dilemma

I’m troubling over the following dilemma:

On the one hand, serendipitous exposure to views unlike your own is good, because that increases the breadth of perspective that’s available to you. You become more cosmopolitan and tolerant.

On the other hand, exposure to views that are hateful, stupid, or evil can be bad, because this can be hurtful, misinforming, or disturbing. Broadly, content can harm.

So, suppose you are deciding what to expose yourself to, or others to, either directly or through the design of some information system.

This requires making a judgment about whether exposure to that perspective will be good or bad.

How is it possible to make that judgment without already having been exposed to it?

Put another way, filter bubbles are sometimes good and sometimes bad. How can you tell the difference, from within a bubble, about whether bridging to another bubble is worthwhile? How could you tell from outside of a bubble? Is there a way to derive this from the nature of bubbles in the abstract?

writing about writing

Years ago on a now defunct Internet forum, somebody recommended that I read a book about the history of writing and its influence on culture.

I just spent ten minutes searching through my email archives trying to find the reference. I didn’t find it.

I’ve been thinking about writing a lot lately. And I’ve been thinking about writing especially tonight, because I was reading this essay that is in a narrow sense about Emily Gould but in a broad sense is about writing.*

I used to find writing about writing insufferable because I thought it was lazy. Only writers with nothing to say about anything else write about writing.

I don’t disagree with that sentiment tonight. Instead I’ve succumbed to the idea that actually writing is a rather specialized activity that is perhaps special because it affords so much of an opportunity to scrutinize and rescrutinize in ways that everyday social interaction does not. By everyday social interaction, I mean specifically the conversations I have with people that are physically present. I am not referring to the social interactions that I conduct through writing with sometimes literally hundreds of people at a time, theoretically, but actually more on the order of I don’t know twenty, every day.

The whole idea that you are supposed to edit what you write before you send it presupposes a reflective editorial process where text, as a condensed signal, is the result of an optimization process over possible interpretations that happens before it is ever emitted. The conscious decision to not edit text as one writes it is difficult if not impossible for some people but for others more…natural. Why?

The fluidity with which writing can morph genres today–it’s gossip, it’s journalism, it’s literature, it’s self expression reflective of genuine character, it’s performance of an assumed character, it’s…–is I think something new.


* Since writing this blog post, I have concluded that this article is quite evil.

It all comes back to Artificial Intelligence

I am blessed with many fascinating conversations every week. Because of the field I am in, these conversations are mainly about technology and people and where they intersect.

Sometimes they are about philosophical themes like how we know anything, or what is ethical. These topics are obviously relevant to an academic researcher, especially when one is interested in computational social science, a kind of science whose ethics have lately been called into question. Other times they are about the theoretical questions that such a science should or could address, like: how do we identify leaders? Or determine what are the ingredients for a thriving community? What is creativity, and how can we mathematically model how it arises from social interaction?

Sometimes the conversations are political. Is it a problem that algorithms are governing more of our political lives and culture? If so, what should we do about it?

The richest and most involved conversations, though, are about artificial intelligence (AI). As a term, it has fallen out of fashion. I was very surprised to see it as a central concept in Bengio et al.’s “Representation Learning: A Review and New Perspectives” [arXiv]. In most discussion scientific computing or ‘data science’ for the most part people have abandoned the idea of intelligent machines. Perhaps this is because so many of the applications of this technology seem so prosaic now. Curating newsfeeds, for example. That can’t be done intelligently. That’s just an algorithm.

Never mind that the origins of all of what we now call machine learning was in the AI research program, which is as old as computer science itself and really has grown up with it. Marvin Minsky famously once defined artificial intelligence as ‘whatever humans still do better than computers.’ And this is the curse of the field. With every technological advance that is at the time mind-blowingly powerful, performing a task that it used to require hundreds of people to perform, it very shortly becomes mere technology.

It’s appropriate then that representation learning, the problem of deriving and selecting features from a complex data set that are valuable for other kinds of statistical analysis in other tasks, is brought up in the context of AI. Because this is precisely the sort of thing that people still think they are comparatively good at. A couple years ago, everyone was talking about the phenomenon of crowdsourced image tagging. People are better at seeing and recognizing objects in images than computers, so in order to, say, provide the data for Google’s Image search, you still need to mobilize lots of people. You just have to organize them as if they were computer functions so that you can properly aggregate their results.

On of the earliest tasks posed to AI, the Turing Test, proposed and named after Alan Turing, the inventor of the fricking computer, is the task of engaging in conversation as if one is a human. This is harder than chess. It is harder than reading handwriting. Something about human communication is so subtle that it has withstood the test of time as an unsolved problem.

Until June of this year, when a program passed the Turing Test in the annual competition. Conversation is no longer something intelligent. It can be performed by a mere algorithm. Indeed, I have heard that a lot of call centers now use scripted dialog. An operator pushes buttons guiding the caller through a conversation that has already been written for them.

So what’s next?

I have a proposal: software engineering. We still don’t have an AI that can write its own source code.

How could we create such an AI? We could use machine learning, training it on data. What’s amazing is that we have vast amounts of data available on what it is like to be a functioning member of a software development team. Open source software communities have provided an enormous corpus of what we can guess is some of the most complex and interesting data ever created. Among other things, this software includes source code for all kinds of other algorithms that were once considered AI.

One reason why I am building BigBang, a toolkit for the scientific analysis of software communities, is because I believe it’s the first step to a better understanding of this very complex and still intelligent process.

While above I have framed AI pessimistically–as what we delegate away from people to machines, that is unnecessarily grim. In fact, with every advance in AI we have come to a better understanding of our world and how we see, hear, think, and do things. The task of trying to scientifically understand how we create together and the task of developing an AI to create with us is in many ways the same task. It’s just a matter of how you look at it.

Protected:

This content is password-protected. To view it, please enter the password below.

i’ve started working on my dissertation // diversity in open source // reflexive data science

I’m studying software development and not social media for my dissertation.

That’s a bit of a false dichotomy. Much software development happens through social media.

Which is really the point–that software development is a computer mediated social process.

What’s neat is that it’s a computer mediated social process that, at its best, creates the conditions for it to continue as a social process. c.f. Kelty’s “recursive public”

What’s also neat is that this is a significant kind of labor that is not easy to think about given the tools of neoclassical economics or anything else really.

In particular I’m focusing on the development of scientific software, i.e. software that’s made and used to improve our scientific understanding of the natural world and each other.

The data I’m looking at is communications data between developers and their users. I’m including the code, under version control, as this. In addition to being communication between developers, you might think of source code as a communication between developers and machines. The process of writing code as a collaboration or conversation between people and machines.

There is a lot of this data so I get to use computational techniques to examine it. “Data science,” if you like.

But it’s also legible, readable data with readily accessible human narrative behind it. As I debug my code, I am reading the messages sent ten years ago on a mailing list. Characters begin to emerge serendipitously because their email signatures break my archive parser. I find myself Googling them. “Who is that person?”

One email I found while debugging stood out because it was written, evidently, by a woman. Given the current press on diversity in tech, I thought it was an interesting example from 2001:

From sag at hydrosphere.com Thu Nov 29 15:21:04 2001
From: sag at hydrosphere.com (Sue Giller)
Date: Thu Nov 29 15:21:04 2001
Subject: [Numpy-discussion] Re: Using Reduce with Multi-dimensional Masked array
In-Reply-To: <000201c17917$ac5efec0$3d01a8c0@plstn1.sfba.home.com>
References: <20011129174809062.AAA210@mail.climatedata.com@SUEW2000>
Message-ID: <20011129232011546.AAA269@mail.climatedata.com@SUEW2000>

Paul,

Well, you’re right. I did misunderstand your reply, as well as what
the various functions were supposed to do. I was mis-using the
sum, minimum, maximum as tho they were MA..reduce, and
my test case didn’t point out the difference. I should always have
been doing the .reduce version.

I apologize for this!

I found a section on page 45 of the Numerical Python text (PDF
form, July 13, 2001) that defines sum as
‘The sum function is a synonym for the reduce method of the add
ufunc. It returns the sum of all the elements in the sequence given
along the specified axis (first axis by default).’

This is where I would expect to see a caveat about it not retaining
any mask-edness.

I was misussing the MA.minimum and MA.maximum as tho they
were .reduce version. My bad.

The MA.average does produce a masked array, but it has changed
the ‘missing value’ to fill_value=[ 1.00000002e+020,]). I do find this
a bit odd, since the other reductions didn’t change the fill value.

Anyway, I can now get the stats I want in a format I want, and I
understand better the various functions for array/masked array.

Thanks for the comments/input.

sue

I am trying to approach this project as a quantitative scientist. But the process of developing the software for analysis is putting me in conversation not just with the laptop I run the software on, but also the data. The data is a quantified representation–I count the number of lines, even the number of characters in a line as I construct the regular expression needed to parse the headers properly–but it represents a conversation in the past. As I write the software, I consult documentation written through a process not unlike the one I am examining, as well as Stack Overflow posts written by others who have tried to perform similar tasks. And now I am writing a blog post about this work. I will tweet a link of this out to my followers; I know some people from the Scientific Python community that I am studying follow me on Twitter. Will one of them catch wind of this post? What will they think of it?

autocatalysis sustains autopoeisis

starting with a problem

The feedback I got on my dissertation prospectus draft when I presented it to my colleagues was that I didn’t start with a problem and then argue from there how my dissertation was going to be about a solution.

That was really great advice.

The problem of problem selection is a difficult. “What is a problem?” is a question that basically nobody asks. Lots of significant philosophical traditions maintain that it’s the perception of problems as problems that is the problem. “Just chill out,” say Great Philosophical Traditions. This does not help one orient ones research dissertation.

A lot of research is motivated by interest in particular problems like an engineering challenge or curing cancer. I’m somehow managed to never acquire the kind of expertise that would allow me to address any of these specific useful problems directly. My mistake.

I’m a social scientist. There are a lot of social problems, right? Of course. However, there’s a problem here that identifying any problems as problems in the social domain immediately implicates politics.

Are there apolitical social problems? I think I’ve found some. I had a great conversation last week with Anna Salamon about Global Catastrophic Risks. Those sound terrible! It echoes the work I used to do in support of Distaster Risk Reduction, except that there is more acknowledgment in the GCR space that some of the big risks are man-made.

So there’s a problem: arguably research into the solutions to these problems is good. On the other hand, that research is complicated by the political entanglement of the researchers, especially in the university setting. It took some convincing, but OK, those politics are necessarily part of the equation. Put another way, if there wasn’t the political complexity, then the hard problems wouldn’t be such hard problems. The hard problems are hard partly because they are so political. (This difference in emphasis is not meant to preclude other reasons why these problems are hard; for example, because people aren’t smart or motivated enough.)

Given that the political complexity is getting in the way of the efficiency of us solving hard problems–because these problems require collaboration across political lines, because the inherent politics of language choice and framing create complexity that is orthogonal to the problem solution (is it?), infrastructural solutions that manage that political complexity can be helpful.

(Counterclaim: the political complexity is not illogical complexity, rather scientific logic is partly political logic. We live in the best of all possible worlds. Just chill out. This is an empirical claim.)

The promise of computational methods to interdisciplinary collaboration is that they allow for more efficient distribution of cognitive labor across the system of investigators. Data science methodologists can build tools for investigation that work cross-disciplinarily, and the interaction between these tools can follow an a political logic in a way that discursive science cannot. Teleologically, we get an Internet of Scientific Things, and autonomous scientific aparatus, and draw your own eschatological conclusions.

An interesting consequence of algorithmically mediated communication is that you don’t actually need consensus to coordinate collective action. I suppose this is an argument Hayekians etc. have been making for a long time. However, the political maintenance of the system that ensures the appropriate incentive structures is itself prone to being hacked and herein lies the problem. That and the insufficiency of the total neurological market aparatus (in Hayek’s vision) to do anything like internalize the externalities of e.g. climate change, while the Bitcoin servers burn and burn and burn.

Data Science: It Gets the Truth!

What follows is the first draft of the introduction to my upcoming book, Data Science: It Gets the Truth! This book will be the popularized version of my dissertation, based on my experiences at the School of Information and UC Berkeley. I’m really curious to know what you think!

There are two kinds of scientists in the world: the truth haters, and the truth getters.

You can tell who is a truth hater by asking them: “With your work, are you trying to find something that’s true?”

A truth hater will tell you that there is no such thing as truth, or that the idea of truth is a problematic bourgeois masculinist social construct, or that truth is relative and so no, not exactly, they probably don’t mean the same thing as you do when you say ‘truth’.

Obviously, these people hate the truth. Hence, “truth haters.”

Then there are the truth getters. You ask a truth getter whether they are trying to discover the truth, and they will say “Hell yeah!” Or, more simply, “yes, that is correct.”

Truth getters love the truth. The truth is great; it’s the point of science. They get that. Hence, “truth getters.”

We are at an amazing, unique time in history. Here, at the dawn of the 21st century, we have very powerful computers and extraordinary networks of communication like never before. This means science is going through some unprecedented changes.

One of those changes is that scientists are realizing that they’ve been fighting about nothing for a long time. Scientists used to think they had to be different from each other in order to study different things. But now, we know that there is only one good way to study anything, and that is machine learning. Soon, all scientists are going to be data scientists, because science is discovering that all things can be represented as data and studied with machine learning.

Well, not all scientists. I should be more precise. I was just talking about the truth getters. Because machine learning is how we can discover the truth about everything, and truth getters get that.

Truth haters, on the other hand, hate how good machine learning is at discovering the truth about everything. Silly truth haters! One day, they will get their funding cut.

In this book, Data Science: It Gets the Truth! you will learn how you too can be a data scientist and learn the truth about things. Get it? Great! Let’s go!