Digifesto

Category: open source software

fancier: scripts to help manage your Twitter account, in Python

My Twitter account has been a source of great entertainment, distraction, and abuse over the years. It is time that I brought it under control. I am too proud and too cheap to buy a professional grade Twitter account manager, and so I’ve begun developing a new suite of tools in Python that will perform the necessary tasks for me.

I’ve decided to name these tools fancier, because the art and science of breeding domestic pigeons is called pigeon fancying. Go figure.

The project is now available on GitHub, and of course I welcome any collaboration or feedback!

At the time of this writing, the project has only one feature: it searches through who you follow on Twitter, finds which accounts are both inactive in 90 days and don’t follow you back, and then unfollows them.

This is a common thing to try to do when grooming and/or professionalizing your Twitter account. I saw a script for this shared in a pastebin years ago, but couldn’t find it again. There are some on-line services that will help you do this, but they charge a fee to do it at scale. Ergo: the open source solution. Voila!

moved BigBang core repository to DATACTIVE organization

I made a small change this evening which I feel really, really good about.

I transferred the BigBang project from my personal GitHub account to the datactive organization.

I’m very grateful for DATACTIVE‘s interest in BigBang and am excited to turn over the project infrastructure to their stewardship.

trust issues and the order of law and technology cf @FrankPasquale

I’ve cut to the last chapter of Pasquale’s The Black Box Society, “Towards an Intelligible Society.” I’m interested in where the argument goes. I see now that I’ve gotten through it that the penultimate chapter has Pasquale’s specific policy recommendations. But as I’m not just reading for policy and framing but also for tone and underlying theoretical commitments, I think it’s worth recording some first impressions before doubling back.

These are some points Pasquale makes in the concluding chapter that I wholeheartedly agree with:

  • A universal basic income would allow more people to engage in high risk activities such as the arts and entrepreneurship and more generally would be great for most people.
  • There should be publicly funded options for finance, search, and information services. A great way to provide these would be to fund the development of open source algorithms for finance and search. I’ve been into this idea for so long and it’s great to see a prominent scholar like Pasquale come to its defense.
  • Regulatory capture (or, as he elaborates following Charles Lindblom, “regulatory circularity”) is a problem. Revolving door participation in government and business makes government regulation an unreliable protector of the public interest.

There is quite a bit in the conclusion about the specifics of regulation the finance industry. There is an impressive amount of knowledge presented about this and I’ll admit much of it is over my head. I’ll probably have a better sense of it if I get to reading the chapter that is specifically about finance.

There are some things that I found bewildering or off-putting.

For example, there is a section on “Restoring Trust” that talks about how an important problem is that we don’t have enough trust in the reputation and search industries. His solution is to increase the penalties that the FTC and FCC can impose on Google and Facebook for its e.g. privacy violations. The current penalties are too trivial to be effective deterrence. But, Pasquale argues,

It is a broken enforcement model, and we have black boxes to thank for much of this. People can’t be outraged by what they can’t understand. And without some public concern about the trivial level of penalties for lawbreaking here, there are no consequences for the politicians ultimately responsible for them.

The logic here is a little mad. Pasquale is saying that people are not outraged enough by search and reputation companies to demand harsher penalties, and this is a problem because people don’t trust these companies enough. The solution is to convince people to trust these companies less–get outraged by them–in order to get them to punish the companies more.

This is a bit troubling, but makes sense based on Pasquale’s theory of regulatory circularity, which turns politics into a tug-of-war between interests:

The dynamic of circularity teaches us that there is no stable static equilibrium to be achieved between regulators and regulated. The government is either pushing industry to realize some public values in its activities (say, by respecting privacy or investing in sustainable growth), or industry is pushing regulators to promote its own interests.

There’s a simplicity to this that I distrust. It suggests for one that there are no public pressures on industry besides the government such as consumer’s buying power. A lot of Pasquale’s arguments depend on the monopolistic power of certain tech giants. But while network effects are strong, it’s not clear whether this is such a problem that consumers have no market buy in. In many cases tech giants compete with each other even when it looks like they aren’t. For example, many many people have both Facebook and Gmail accounts. Since there is somewhat redundant functionality in both, consumers can rather seemlessly allocate their time, which is tied to advertising revenue, according to which service they feel better serves them, or which is best reputationally. So social media (which is a bit like a combination of a search and reputation service) is not a monopoly. Similarly, if people have multiple search options available to them because, say, the have both Siri on their smart phone and can search Google directly, then that provides an alternative search market.

Meanwhile, government officials are also often self-interested. If there is a road to hell for industry that is to provide free web services to people to attain massive scale, then abuse economic lock-in to extract value from customers, then lobby for further rent-seeking, there is a similar road to hell in government. It starts with populist demagoguery, leads to stable government appointment, and then leverages that power for rents in status.

So, power is power. Everybody tries to get power. The question is what you do once you get it, right?

Perhaps I’m reading between the lines too much. Of course, my evaluation of the book should depend most on the concrete policy recommendations which I haven’t gotten to yet. But I find it unfortunate that what seems to be a lot of perfectly sound history and policy analysis is wrapped in a politics of professional identity that I find very counterproductive. The last paragraph of the book is:

Black box services are often wondrous to behold, but our black-box society has become dangerously unstable, unfair, and unproductive. Neither New York quants nor California engineers can deliver a sound economy or a secure society. Those are the tasks of a citizenry, which can perform its job only as well as it understands the stakes.

Implicitly, New York quants and California engineers are not citizens, to Pasquale, a law professor based in Maryland. Do all real citizens live around Washington, DC? Are they all lawyers? If the government were to start providing public information services, either by hosting them themselves or by funding open source alternatives, would he want everyone designing these open algorithms (who would be quants or engineers, I presume) to move to DC? Do citizens really need to understand the stakes in order to get this to happen? When have citizens, en masse, understood anything, really?

Based on what I’ve read so far, The Black Box Society is an expression of a lack of trust in the social and economic power associated with quantification and computing that took off in the past few dot-com booms. Since expressions of lack of trust for these industries is nothing new, one might wonder (under the influence of Foucault) how the quantified order and the critique of the quantified order manage to coexist and recreate a system of discipline that includes both and maintains its power as a complex of superficially agonistic forces. I give sincere credit to Pasquale for advocating both series income redistribution and public investment in open technology as ways of disrupting that order. But when he falls into the trap of engendering partisan distrust, he loses my confidence.

Innovation, automation, and inequality

What is the economic relationship between innovation, automation, and inequality?

This is a recurring topic in the discussion of technology and the economy. It comes up when people are worried about a new innovation (such as data science) that threatens their livelihood. It also comes up in discussions of inequality, such as in Piketty’s Capital in the Twenty-First Century.

For technological pessimists, innovation implies automation, and automation suggests the transfer of surplus from many service providers to a technological monopolist providing a substitute service at greater scale (scale being one of the primary benefits of automation).

For Piketty, it’s the spread of innovation in the sense of the education of skilled labor that is primary force that counteracts capitalism’s tendency towards inequality and (he suggests) the implied instability. For the importance Piketty places on this process, he treats it hardly at all in his book.

Whether or not you buy Piketty’s analysis, the preceding discussion indicates how innovation can cut both for and against inequality. When there is innovation in capital goods, this increases inequality. When there is innovation in a kind of skilled technique that can be broadly taught, that decreases inequality by increasing the relative value of labor to capital (which is generally much more concentrated than labor).

I’m a software engineer in the Bay Area and realize that it’s easy to overestimate the importance of software in the economy at large. This is apparently an easy mistake for other people to make as well. Matthew Rognlie, the economist who has been declared Piketty’s latest and greatest challenger, thinks that software is an important new form of capital and draws certain conclusions based on this.

I agree that software is an important form of capital–exactly how important I cannot yet say. One reason why software is an especially interesting kind of capital is that it exists ambiguously as both a capital good and as a skilled technique. While naively one can consider software as an artifact in isolation from its social environment, in the dynamic information economy a piece of software is only as good as the sociotechnical system in which it is embedded. Hence, its value depends both on its affordances as a capital good and its role as an extension of labor technique. It is perhaps easiest to see the latter aspect of software by considering it a form of extended cognition on the part of the software developer. The human capital required to understand, reproduce, and maintain the software is attained by, for example, studying its source code and documentation.

All software is a form of innovation. All software automates something. There has been a lot written about the potential effects of software on inequality through its function in decision-making (for example: Solon Barocas, Andrew D. Selbst, “Big Data’s Disparate Impact” (link).) Much less has been said about the effects of software on inequality through its effects on industrial organization and the labor market. After having my antennas up for this for many reasons, I’ve come to a conclusion about why: it’s because the intersection between those who are concerned about inequality in society and those that can identify well enough with software engineers and other skilled laborers is quite small. As a result there is not a ready audience for this kind of analysis.

However unreceptive society may be to it, I think it’s still worth making the point that we already have a very common and robust compromise in the technology industry that recognizes software’s dual role as a capital good and labor technique. This compromise is open source software. Open source software can exist both as an unalienated extension of its developer’s cognition and as a capital good playing a role in a production process. Human capital tied to the software is liquid between the software’s users. Surplus due to open software innovations goes first to the software users, then second to the ecosystem of developers who sell services around it. Contrast this with the proprietary case, where surplus goes mainly to a singular entity that owns and sells the software rights as a monopolist. The former case is vastly better if one considers societal equality a positive outcome.

This has straightforward policy implications. As an alternative to Piketty’s proposed tax on capital, any policies that encourage open source software are ones that combat societal inequality. This includes procurement policies, which need not increase government spending. On the contrary, if governments procure primarily open software, that should lead to savings over time as their investment leads to a more competitive market for services. Equivalently, R&D funding to open science institutions results in more income equality than equivalent funding provided to private companies.

Hirschman, Nigerian railroads, and poor open source user interfaces

Hirschman says he got the idea for Exit, Voice, and Loyalty when studying the failure of the Nigerian railroad system to improve quality despite the availability of trucking as a substitute for long-range shipping. Conventional wisdom among economists at the time was that the quality of a good would suffer when it was provisioned by a monopoly. But why would a business that faced healthy competition not undergo the management changes needed to improve quality?

Hirschman’s answer is that because the trucking option was so readily available as an alternative, there wasn’t a need for consumers to develop their capacity for voice. The railroads weren’t hearing the complaints about their service, they were just seeing a decline in use as their customers exited. Meanwhile, because it was a monopoly, loss in revenue wasn’t “of utmost gravity” to the railway managers either.

The upshot of this is that it’s only when customers are locked in that voice plays a critical role in the recuperation mechanism.

This is interesting for me because I’m interested in the role of lock-in in software development. In particular, one argument made in favor of open source software is that because it is not technology held by a single firm, users of the software are not locked-in. Their switching costs are reduced, making the market more liquid and, in theory favorable.

You can contrast this with proprietary enterprise software, where vendor lock-in is a principle part of the business model as this establishes the “installed base” and customer support armies are necessary for managing disgruntled customer voice. Or, in the case of social media such as Facebook, network effects create a kind of perceived consumer lock-in and consumer voice gets articulated by everybody from Twitter activists to journalists to high-profile academics.

As much as it pains me to admit it, this is one good explanation for why the user interfaces of a lot of open source software projects are so bad specifically if you combine this mechanism with the idea that user-centered design is important for user interfaces. Open source projects generally make it easy to complain about the software. If they know what they are doing at all, they make it clear how to engage the developers as a user. There is a kind of rumor out there that open source developers are unfriendly towards users and this is perhaps true when users are used to the kind of customer support that’s available on a product for which there is customer lock-in. It’s precisely this difference between exit culture and voice culture, driven by the fundamental economics of the industry, that creates this perception. Enterprise open source business models (I’m thinking about models like the Pentaho ‘beekeeper’) theoretically provide a corrective to this by being an intermediary between consumer voice and developer exit.

A testable hypothesis is whether and to what extent a software project’s responsiveness to tickets scales with the number of downstream dependent projects. In software development, technical architecture is a reasonable proxy for industrial organization. A widely used project has network effects that increasing switching costs for its downstream users. How do exit and voice work in this context?

The node.js fork — something new to think about

For Classics we are reading Albert Hirschman’s Exit, Voice, and Loyalty. Oddly, though normally I hear about ‘voice’ as an action from within an organization, the first few chapters of the book (including the introduction of the Voice concept itselt), are preoccupied with elaborations on the neoclassical market mechanism. Not what I expected.

I’m looking for interesting research use cases for BigBang, which is about analyzing the sociotechnical dynamics of collaboration. I’m building it to better understand open source software development communities, primarily. This is because I want to create a harmonious sociotechnical superintelligence to take over the world.

For a while I’ve been interested in Hadoop’s interesting case of being one software project with two companies working together to build it. This is reminiscent (for me) of when we started GeoExt at OpenGeo and Camp2Camp. The economics of shared capital are fascinating and there are interesting questions about how human resources get organized in that sort of situation. In my experience, there becomes a tension between the needs of firms to differentiate their products and make good on their contracts and the needs of the developer community whose collective value is ultimately tied to the robustness of their technology.

Unfortunately, building out BigBang to integrate with various email, version control, and issue tracking backends is a lot of work and there’s only one of me right now to both build the infrastructure, do the research, and train new collaborators (who are starting to do some awesome work, so this is paying off.) While integrating with Apache’s infrastructure would have been a smart first move, instead I chose to focus on Mailman archives and git repositories. Google Groups and whatever Apache is using for their email lists do not publish their archives in .mbox format, which is pain for me. But luckily Google Takeout does export data from folks’ on-line inbox in .mbox format. This is great for BigBang because it means we can investigate email data from any project for which we know an insider willing to share their records.

Does a research ethics issue arise when you start working with email that is openly archived in a difficult format, then exported from somebody’s private email? Technically you get header information that wasn’t open before–perhaps it was ‘private’. But arguably this header information isn’t personal information. I think I’m still in the clear. Plus, IRB will be irrelevent when the robots take over.

All of this is a long way of getting around to talking about a new thing I’m wondering about, the Node.js fork. It’s interesting to think about open source software forks in light of Hirschman’s concepts of Exit and Voice since so much of the activity of open source development is open, virtual communication. While you might at first think a software fork is definitely a kind of Exit, it sounds like IO.js was perhaps a friendly fork of just somebody who wanted to hack around. In theory, code can be shared between forks–in fact this was the principle that GitHub’s forking system was founded on. So there are open questions (to me, who isn’t involved in the Node.js community at all and is just now beginning to wonder about it) along the lines of to what extent a fork is a real event in the history of the project, vs. to what extent it’s mythological, vs. to what extent it’s a reification of something that was already implicit in the project’s sociotechnical structure. There are probably other great questions here as well.

A friend on the inside tells me all the action on this happened (is happening?) on the GitHub issue tracker, which is definitely data we want to get BigBang connected with. Blissfully, there appear to be well supported Python libraries for working with the GitHub API. I expect the first big hurdle we hit here will be rate limiting.

Though we haven’t been able to make integration work yet, I’m still hoping there’s some way we can work with MetricsGrimoire. They’ve been a super inviting community so far. But our software stacks and architecture are just different enough, and the layers we’ve built so far thin enough, that it’s hard to see how to do the merge. A major difference is that while MetricsGrimoire tools are built to provide application interfaces around a MySQL data backend, since BigBang is foremost about scientific analysis our whole data pipeline is built to get things into Pandas dataframes. Both projects are in Python. This too is a weird microcosm of the larger sociotechnical ecosystem of software production, of which the “open” side is only one (important) part.

Imre Lakatos and programming as dialectic

My dissertation is about the role of software in scholarly communication. Specifically, I’m interested in the way software code is itself a kind of scholarly communication, and how the informal communications around software production represent and constitute communities of scientists. I see science as a cognitive task accomplished by the sociotechnical system of science, including both scientists and their infrastructure. Looking particularly at scientist’s use of communications infrastructure such as email, issue trackers, and version control, I hope to study the mechanisms of the scientific process much like a neuroscientist studies the mechanisms of the mind by studying neural architecture and brainwave activity.

To get a grip on this problem I’ve been building BigBang, a tool for collecting data from open source projects and readying it for scientific analysis.

I have also been reading background literature to give my dissertation work theoretical heft and to procrastinate from coding. This is why I have been reading Imre Lakatos’ Proofs and Refutations (1976).

Proofs and Refutations is a brilliantly written book about the history of mathematical proof. In particular, it is an analysis of informal mathematics through an investigation of the letters written by mathematicians working on proofs about the Euler characteristic of polyhedra in the 18th and 19th centuries.

Whereas in the early 20th century, based on the work of Russel and Whitehead and others, formal logic was axiomatized, prior to this mathematical argumentation had less formal grounding. As a result, mathematicians would argue not just substantively about the theorem they were trying to prove or disprove, but also about what constitutes a proof, a conjecture, or a theorem in the first place. Lakatos demonstrates this by condensing 200+ years of scholarly communication into a fictional, impassioned classroom dialog where characters representing mathematicians throughout history banter about polyhedra and proof techniques.

What’s fascinating is how convincingly Lakatos presents the progress of mathematical understanding as an example of dialectical logic. Though he doesn’t use the word “dialectical” as far as I’m aware, he tells the story of the informal logic of pre-Russellian mathematics through dialog. But this dialog is designed to capture the timeless logic behind what’s been said before. It takes the reader through the thought process of mathematical discovery in abbreviated form.

I’ve had conversations with serious historians and ethnographers of science who would object strongly to the idea of a history of a scientific discipline reflecting a “timeless logic”. Historians are apt to think that nothing is timeless. I’m inclined to think that the objectivity of logic persists over time much the same way that it persists over space and between subjects, even illogical ones, hence its power. These are perhaps theological questions.

What I’d like to argue (but am not sure how) is that the process of informal mathematics presented by Lakatos is strikingly similar to that used by software engineers. The process of selecting a conjecture, then of writing a proof (which for Lakatos is a logical argument whether or not it is sound or valid), then having it critiqued with counterexamples, which may either be global (counter to the original conjecture) or local (counter to a lemma), then modifying the proof, then perhaps starting from scratch based on a new insight… all this reads uncannily like the process of debugging source code.

The argument for this correspondence is strengthened by later work in theory of computation and complexity theory. I learned this theory so long ago I forget who to attribute it to, but much of the foundational work in computer science was the establishment of a correspondence between classes of formal logic and classes of programming languages. So in a sense its uncontroversial within computer science to consider programs to be proofs.

As I write I am unsure whether I’m simply restating what’s obvious to computer scientists in an antiquated philosophical language (a danger I feel every time I read a book, lately) or if I’m capturing something that could be an interesting synthesis. But my point is this: that if programming language design and the construction of progressively more powerful software libraries is akin to the expanding of formal mathematical knowledge from axiomatic grounds, then the act of programming itself is much more like the informal mathematics of pre-Russellian mathematics. Specifically, in that it is unaxiomatic and proofs are in play without necessarily being sound. When we use a software system, we are depending necessarily on a system of imperfected proofs that we fix iteratively through discovered counterexamples (bugs).

Is it fair to say, then, that whereas the logic of software is formal, deductive logic, the logic of programming is dialectical logic?

Bear with me; let’s presume it is. That’s a foundational idea of my dissertation work. Proving or disproving it may or may not be out of scope of the dissertation itself, but it’s where it’s ultimately headed.

The question is whether it is possible to develop a formal understanding of dialectical logic through a scientific analysis of the software collaboration. (see a mathematical model of collective creativity). If this could be done, then we could then build better software or protocols to assist this dialectical process.

technical work

Dipping into Julian Orr’s Talking about Machines, an ethnography of Xerox photocopier technicians, has set off some light bulbs for me.

First, there’s Orr’s story: Orr dropped out of college and got drafted, then worked as a technician in the military before returning to school. He paid the bills doing technical repair work, and found it convenient to do his dissertation on those doing photocopy repair.

Orr’s story reminds me of my grandfather and great-uncle, both of whom were technicians–radio operators–during WWII. Their civilian careers were as carpenters, building houses.

My own dissertation research is motivated by my work background as an open source engineer, and my own desire to maintain and improve my technical chops. I’d like to learn to be a data scientist; I’m also studying data scientists at work.

Further fascinating was Orr’s discussion of the Xerox technician’s identity as technicians as opposed to customers:

The distinction between technician and customer is a critical division of this population, but for technicians at work, all nontechnicians are in some category of other, including the corporation that employs the technicians, which is seen as alien, distant, and only sometimes an ally.

It’s interesting to read about this distinction between technicians and others in the context of Xerox photocopiers when I’ve been so affected lately by the distinction between tech folk and others and data scientists and others. This distinction between those who do technical work and those who they serve is a deep historical one that transcends the contemporary and over-computed world.

I recall my earlier work experience. I was a decent engineer and engineering project manager. I was a horrible account manager. My customer service skills were abysmal, because I did not empathize with the client. The open source context contributes to this attitude, because it makes a different set of demands on its users than consumer technology does. One gets assistance with consumer grade technology by hiring a technician who treats you as a customer. You get assistance with open source technology by joining the community of practice as a technician. Commercial open source software, according to the Pentaho beekeeper model, is about providing, at cost, that customer support.

I’ve been thinking about customer service and reflecting on my failures at it a lot lately. It keeps coming up. Mary Gray’s piece, When Science, Customer Service, and Human Subjects Research Collide explicitly makes the connection between commercial data science at Facebook and customer service. The ugly dispute between Gratipay (formerly Gittip) and Shanley Kane was, I realized after the fact, a similar crisis between the expectations of customers/customer service people and the expectations of open source communities. When “free” (gratis) web services display a similar disregard for their users as open source communities do, it’s harder to justify in the same way that FOSS does. But there are similar tensions, perhaps. It’s hard for technicians to empathize with non-technicians about their technical problems, because their lived experience is so different.

It’s alarming how much is being hinged on the professional distinction between technical worker and non-technical worker. The intra-technology industry debates are thick with confusions along these lines. What about marketing people in the tech context? Sales? Are the “tech folks” responsible for distributional justice today? Are they in the throws of an ideology? I was reading a paper the other day suggesting that software engineers should be held ethically accountable for the implicit moral implications of their algorithms. Specifically the engineers; for some reason not the designers or product managers or corporate shareholders, who were not mentioned. An interesting proposal.

Meanwhile, at the D-Lab, where I work, I’m in the process of navigating my relationship between two teams, the Technical Team, and the Services Team. I have been on the Technical team in the past. Our work has been to stay on top of and assist people with data science software and infrastructure. Early on, we abolished regular meetings as a waste of time. Naturally, there was a suspicion expressed to me at one point that we were unaccountable and didn’t do as much work as others on the Services team, which dealt directly with the people-facing component of the lab–scheduling workshops, managing the undergraduate work-study staff. Sitting in on Services meetings for the first time this semester, I’ve been struck by how much work the other team does. By and large, it’s information work: calendering, scheduling, entering into spreadsheets, documenting processes in case of turnover, sending emails out, responding to emails. All important work.

This is exactly the work that information technicians want to automate away. If there is a way to reduce the amount of calendering and entering into spreadsheets, programmers will find a way. The whole purpose of computer science is to automate tasks that would otherwise be tedious.

Eric S. Raymond’s classic (2001) essay How to Become a Hacker characterizes the Hacker Attitude, in five points:

  1. The world is full of fascinating problems waiting to be solved.
  2. No problem should ever have to be solved twice.
  3. Boredom and drudgery are evil.
  4. Freedom is good.
  5. Attitude is no substitute for competence.

There is no better articulation of the “ideology” of “tech folks” than this, in my opinion, yet Raymond is not used much as a source for understanding the idiosyncracies of the technical industry today. Of course, not all “hackers” are well characterized by Raymond (I’m reminded of Coleman’s injunction to speak of “cultures of hacking”) and not all software engineers are hackers (I’m sure my sister, a software engineer, is not a hacker. For example, based on my conversations with her, it’s clear that she does not see all the unsolved problems with the world to be intrinsically fascinating. Rather, she finds problems that pertain to some human interest, like children’s education, to be most motivating. I have no doubt that she is a much better software engineer than I am–she has worked full time at it for many years and now works for a top tech company. As somebody closer to the Raymond Hacker ethic, I recognize that my own attitude is no substitute for that competence, and hold my sister’s abilities in very high esteem.)

As usual, I appear to have forgotten where I was going with this.

Protected: some ruminations regarding ‘openness’

This content is password protected. To view it please enter your password below: