Digifesto

Tag: communication

population traits, culture traits, and racial projects: a methods challenge #ica18

In a recent paper I’ve been working on with Mark Hannah that he’s presenting this week at the International Communications Association conference, we take on the question of whether and how “big data” can be used to study the culture of a population.

By “big data” we meant, roughly large social media data sets. The pitfalls of using this sort of data for any general study of a population are perhaps best articled by Tufekci (2014). In short: studies based on social media data are often sampling on the dependent variable because they only consider the people representing themselves on social media, though this is only a small portion of the population. To put it another way, the sample suffers from the 1% rule of Internet cultures: for any on-line community, only 1% create content, 10% interact with the content somehow, and the rest lurk. The behavior and attitudes of the lurkers, in addition to any field effects in the “background” of the data (latent variables in the social field of production), are all out of band and so opaque to the analyst.

By “the culture of a population”, we meant something specific: the distribution of values, beliefs, dispositions, and tastes of a particular group of people. The best source we found on this was Marsden and Swingle (1994), and article from a time before the Internet had started to transform academia. Then and perhaps now, the best way to study the distribution of culture across a broad population was a survey. The idea is that you sample the population according to some responsible statistics, you ask them some questions about their values, beliefs, dispositions, and tastes, and you report the results. Viola!

(Given the methodological divergence here, the fact that many people, especially ‘people on the Internet’, now view culture mainly through the lens of other people on the Internet is obviously a huge problem. Most people are not in this sample, and yet we pretend that it is representative because it’s easily available for analysis. Hence, our concept of culture (or cultures) is screwy, reflecting much more than is warranted whatever sorts of cultures are flourishing in a pseudonymous, bot-ridden, commercial attention economy.)

Can we productively combine social media data with surveys methods to get a better method for studying the culture of a population? We think so. We propose the following as a general method framework:

(1) Figure out the population of interest by their stable, independent ‘population traits’ and look for their activity on social media. Sample from this.

(2) Do exploratory data analysis to inductively get content themes and observations about social structure from this data.

(3) Use the inductively generated themes from step (2) to design a survey addressing cultural traits of the population (beliefs, values, dispositions, tastes).

(4) Conduct a stratified sample specifically across social media creators, synthesizers (e.g. people who like, retweet, and respond), and the general population and/or known audience, and distribute the survey.

(5) Extrapolate the results to general conclusions.

(6) Validate the conclusions with other data or not discrepancies for future iterations.

I feel pretty good about this framework as a step forward, except that in the interest of time we had to sidestep what is maybe the most interesting question raised by it, which is: what’s the difference between a population trait and a cultural trait.

Here’s what we were thinking:

Population trait Cultural trait
Location Twitter use (creator, synthesizer, lurker, none)
Age Political views: left, right, center
Permanent unique identifier Attitude towards media
Preferred news source
Pepsi or coke?

One thing to note: we decided that traits about media production and consumption were a subtype of cultural traits. I.e., if you use Twitter, that’s a particular cultural trait that may be correlated with other cultural traits. That makes the problem of sampling on the dependent variable explicit.

But the other thing to note is that there are certain categories that we did not put on this list. Which ones? Gender, race, etc. Why not? Because choosing whether these are population traits or cultural traits opens a big bag of worms that is the subject of active political contest. That discussion was well beyond the scope of the paper!

The dicey thing about this kind of research is that we explicitly designed it to try to avoid investigator bias. That includes the bias of seeing the world through social categories that we might otherwise naturalize of reify. Naturally, though, if we were to actually conduct this method on a sample, such as, I dunno, a sample of Twitter-using academics, we would very quickly discover that certain social categories (men, women, person of color, etc.) were themes people talked about and so would be included as survey items under cultural traits.

That is not terrible. It’s probably safer to do that than to treat them like immutable, independent properties of a person. It does seem to leave something out though. For example, say one were to identify race as a cultural trait and then ask people to identify with a race. Then one takes the results, does a factor analysis, and discovers a factor that combines a racial affinity with media preferences and participation rates. It then identifies the prevalence of this factor in a certain region with a certain age demographic. One might object to this result as a representation of a racial category as entailing certain cultural categories, and leaving out the cultural minority within a racial demographic that wants more representation.

This is upsetting to some people when, for example, Facebook does this and allows advertisers to target things based on “ethnic affinity”. Presumably, Facebook is doing just this kind of factor analysis when they identify these categories.

Arguably, that’s not what this sort of science is for. But the fact that the objection seems pertinent is an informative intuition in its own right.

Maybe the right framework for understanding why this is problematic is Omi and Winant’s racial formation theory (2014). I’m just getting into this theory recently, at the recommendation of Bruce Haynes, who I look up to as an authority on race in America. According to racial projects theory, racial categories are stable because they include both representations of groups of people as having certain qualities and social structures controlling the distribution of resources. So, the white/black divide in the U.S. is both racial stereotypes and segregating urban policy, because the divide is stable because of how the material and cultural factors reinforce each other.

This view is enlightening because it helps explain why hereditary phenotype, representations of people based on hereditary phenotype, requests for people to identify with a race even when this may not make any sense, policies about inheritance and schooling, etc. all are part of the same complex. When we were setting out to develop the method described above, we were trying to correct for a sampling bias in media while testing for the distribution of culture across some objectively determinable population variables. But the objective qualities (such as zip code) are themselves functions of the cultural traits when considered over the course of time. In short, our model, which just tabulates individual differences without looking at temporal mechanisms, is naive.

But it’s a start, if only to an interesting discussion.

References

Marsden, Peter V., and Joseph F. Swingle. “Conceptualizing and measuring culture in surveys: Values, strategies, and symbols.” Poetics 22.4 (1994): 269-289.

Omi, Michael, and Howard Winant. Racial formation in the United States. Routledge, 2014.

Tufekci, Zeynep. “Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls.” ICWSM 14 (2014): 505-514.

Privacy, trust, context, and legitimate peripheral participation

Privacy is important. For Nissenbaum, what’s essential to privacy is control over context. But what is context?

Using Luhmann’s framework of social systems–ignoring for a moment e.g. Habermas’ criticism and accepting the naturalized, systems theoretic understanding of society–we would have to see a context as a subsystem of the total social system. In so far as the social system is constituted by many acts of communication–let’s visualize this as a network of agents, whose edges are acts of communication–then a context is something preserved by configurations of agents and the way they interact.

Some of the forces that shape a social system will be exogenous. A river dividing two cities or, more abstractly, distance. In the digital domain, the barriers of interoperability between one virtual community infrastructure and another.

But others will be endogenous, formed from the social interactions themselves. An example is the gradual deepening of trust between agents based on a history of communication. Perhaps early conversations are formal, stilted. Later, an agent takes a risk, sharing something more personal–more private? It is reciprocated. Slowly, a trust bond, an evinced sharing of interests and mutual investment, becomes the foundation of cooperation. The Prisoner’s Dilemma is solved the old fashioned way.

Following Carey’s logic that communication as mere transmission when sustained over time becomes communication as ritual and the foundation of community, we can look at this slow process of trust formation as one of the ways that a context, in Nissenbaum’s sense, perhaps, forms. If Anne and Betsy have mutually internalized each others interests, then information flow between them will by and large support the interests of the pair, and Betsy will have low incentives to reveal private information in a way that would be detrimental to Anne.

Of course this is a huge oversimplification in lots of ways. One way is that it does not take into account the way the same agent may participant in many social roles or contexts. Communication is not a single edge from one agent to another in many circumstances. Perhaps the situation is better represented as a hypergraph. One reason why this whole domain may be so difficult to reason about is the sheer representational complexity of modeling the situation. It may require the kind of mathematical sophistication used by quantum physicists. Why not?

Not having that kind of insight into the problem yet, I will continue to sling what the social scientists call ‘theory’. Let’s talk about an exisiting community of practice, where the practice is a certain kind of communication. A community of scholars. A community of software developers. Weird Twitter. A backchannel mailing list coordinating a political campaign. A church.

According to Lave and Wenger, the way newcomers gradually become members and oldtimers of a community of practice is legitimate peripheral participation. This is consistent with the model described above characterizing the growth of trust through gradually deepening communication. Peripheral participation is low-risk. In an open source context, this might be as simple as writing a question to the mailing list or filing a bug report. Over time, the agent displays good faith and competence. (I’m disappointed to read just now that Wenger ultimately abandoned this model in favor of a theory of dualities. Is that a Hail Mary for empirical content for the theory? Also interested to follow links on this topic to a citation of von Krogh 1998, whose later work found its way onto my Open Collaboration and Peer Production syllabus. It’s a small world.

I’ve begun reading as I write this fascinating paper by Hildreth and Kimble 2002 and am now have lost my thread. Can I recover?)

Some questions:

  • Can this process of context-formation be characterized empirically through an analysis of e.g. the timing dynamics of communication (c.f. Thomas Maillart’s work)? If so, what does that tell us about the design of information systems for privacy?
  • What about illegitimate peripheral participation? Arguably, this blog is that kind of participation–it participates in a form of informal, unendorsed quasi-scholarship. It is a tool of context and disciplinary collapse. Is that a kind of violation of privacy? Why not?