In a recent paper I’ve been working on with Mark Hannah that he’s presenting this week at the International Communications Association conference, we take on the question of whether and how “big data” can be used to study the culture of a population.
By “big data” we meant, roughly large social media data sets. The pitfalls of using this sort of data for any general study of a population are perhaps best articled by Tufekci (2014). In short: studies based on social media data are often sampling on the dependent variable because they only consider the people representing themselves on social media, though this is only a small portion of the population. To put it another way, the sample suffers from the 1% rule of Internet cultures: for any on-line community, only 1% create content, 10% interact with the content somehow, and the rest lurk. The behavior and attitudes of the lurkers, in addition to any field effects in the “background” of the data (latent variables in the social field of production), are all out of band and so opaque to the analyst.
By “the culture of a population”, we meant something specific: the distribution of values, beliefs, dispositions, and tastes of a particular group of people. The best source we found on this was Marsden and Swingle (1994), and article from a time before the Internet had started to transform academia. Then and perhaps now, the best way to study the distribution of culture across a broad population was a survey. The idea is that you sample the population according to some responsible statistics, you ask them some questions about their values, beliefs, dispositions, and tastes, and you report the results. Viola!
(Given the methodological divergence here, the fact that many people, especially ‘people on the Internet’, now view culture mainly through the lens of other people on the Internet is obviously a huge problem. Most people are not in this sample, and yet we pretend that it is representative because it’s easily available for analysis. Hence, our concept of culture (or cultures) is screwy, reflecting much more than is warranted whatever sorts of cultures are flourishing in a pseudonymous, bot-ridden, commercial attention economy.)
Can we productively combine social media data with surveys methods to get a better method for studying the culture of a population? We think so. We propose the following as a general method framework:
(1) Figure out the population of interest by their stable, independent ‘population traits’ and look for their activity on social media. Sample from this.
(2) Do exploratory data analysis to inductively get content themes and observations about social structure from this data.
(3) Use the inductively generated themes from step (2) to design a survey addressing cultural traits of the population (beliefs, values, dispositions, tastes).
(4) Conduct a stratified sample specifically across social media creators, synthesizers (e.g. people who like, retweet, and respond), and the general population and/or known audience, and distribute the survey.
(5) Extrapolate the results to general conclusions.
(6) Validate the conclusions with other data or not discrepancies for future iterations.
I feel pretty good about this framework as a step forward, except that in the interest of time we had to sidestep what is maybe the most interesting question raised by it, which is: what’s the difference between a population trait and a cultural trait.
Here’s what we were thinking:
|Population trait||Cultural trait|
|Location||Twitter use (creator, synthesizer, lurker, none)|
|Age||Political views: left, right, center|
|Permanent unique identifier||Attitude towards media|
|Preferred news source|
|Pepsi or coke?|
One thing to note: we decided that traits about media production and consumption were a subtype of cultural traits. I.e., if you use Twitter, that’s a particular cultural trait that may be correlated with other cultural traits. That makes the problem of sampling on the dependent variable explicit.
But the other thing to note is that there are certain categories that we did not put on this list. Which ones? Gender, race, etc. Why not? Because choosing whether these are population traits or cultural traits opens a big bag of worms that is the subject of active political contest. That discussion was well beyond the scope of the paper!
The dicey thing about this kind of research is that we explicitly designed it to try to avoid investigator bias. That includes the bias of seeing the world through social categories that we might otherwise naturalize of reify. Naturally, though, if we were to actually conduct this method on a sample, such as, I dunno, a sample of Twitter-using academics, we would very quickly discover that certain social categories (men, women, person of color, etc.) were themes people talked about and so would be included as survey items under cultural traits.
That is not terrible. It’s probably safer to do that than to treat them like immutable, independent properties of a person. It does seem to leave something out though. For example, say one were to identify race as a cultural trait and then ask people to identify with a race. Then one takes the results, does a factor analysis, and discovers a factor that combines a racial affinity with media preferences and participation rates. It then identifies the prevalence of this factor in a certain region with a certain age demographic. One might object to this result as a representation of a racial category as entailing certain cultural categories, and leaving out the cultural minority within a racial demographic that wants more representation.
This is upsetting to some people when, for example, Facebook does this and allows advertisers to target things based on “ethnic affinity”. Presumably, Facebook is doing just this kind of factor analysis when they identify these categories.
Arguably, that’s not what this sort of science is for. But the fact that the objection seems pertinent is an informative intuition in its own right.
Maybe the right framework for understanding why this is problematic is Omi and Winant’s racial formation theory (2014). I’m just getting into this theory recently, at the recommendation of Bruce Haynes, who I look up to as an authority on race in America. According to racial projects theory, racial categories are stable because they include both representations of groups of people as having certain qualities and social structures controlling the distribution of resources. So, the white/black divide in the U.S. is both racial stereotypes and segregating urban policy, because the divide is stable because of how the material and cultural factors reinforce each other.
This view is enlightening because it helps explain why hereditary phenotype, representations of people based on hereditary phenotype, requests for people to identify with a race even when this may not make any sense, policies about inheritance and schooling, etc. all are part of the same complex. When we were setting out to develop the method described above, we were trying to correct for a sampling bias in media while testing for the distribution of culture across some objectively determinable population variables. But the objective qualities (such as zip code) are themselves functions of the cultural traits when considered over the course of time. In short, our model, which just tabulates individual differences without looking at temporal mechanisms, is naive.
But it’s a start, if only to an interesting discussion.
Marsden, Peter V., and Joseph F. Swingle. “Conceptualizing and measuring culture in surveys: Values, strategies, and symbols.” Poetics 22.4 (1994): 269-289.
Omi, Michael, and Howard Winant. Racial formation in the United States. Routledge, 2014.
Tufekci, Zeynep. “Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls.” ICWSM 14 (2014): 505-514.