Digifesto

Tag: racial projects

On descent-based discrimination (a reply to Hanna et al. 2020)

In what is likely to be a precedent-setting case, California regulators filed a suit in the federal court on June 30 against Cisco Systems Inc, alleging that the company failed to prevent discrimination, harassment and retaliation against a Dalit engineer, anonymised as “John Doe” in the filing.

The Cisco case bears the burden of making anti-Dalit prejudice legible to American civil rights law as an extreme form of social disability attached to those formerly classified as “Untouchable.” Herein lies its key legal significance. The suit implicitly compares two systems of descent-based discrimination – caste and race – and translates between them to find points of convergence or family resemblance.

A. Rao, link

There is not much I can add to this article about caste-based discrimination in the U.S. In the law suit, a team of high caste South Asians in California is alleged to have discriminated against a Dalit engineer coworker. The work of the law suit is to make caste-based discrimination legible to American civil rights law. It, correctly, in my view, draws the connection to race.

This illustrative example prompts me to respond to Hanna et al.’s 2020 “Towards a critical race methodology in algorithmic fairness.” This paper by a Google team included a serious, thoughtful consideration of the argument I put forward with my co-author Bruce Haynes in “Racial categories in machine learning”. I like the Hanna et al. paper, think it makes interesting and valid points about the multidimensionality of race, and am grateful for their attention to my work.

I also disagree with some of their characterization of our argument and one of the positions they take. For some time I’ve intended to write a response. Now is a fine time.

First, a quibble: Hanna et al. describe Bruce D. Haynes as a “critical race scholar” and while he may have changed his mind since our writing, at the time he was adamant (in conversation) that he is not a critical race scholar, but that “critical race studies” refers to a specific intellectual project of racial critique that just happens to be really trendy on Twitter. There are lots and lots of other ways to study race critically that are not “critical race studies”. I believe this point was important to Bruce as a matter of scholarly identity. I also feel that it’s an important point because, frankly, I don’t find a lot of “critical race studies” scholarship persuasive and I probably wouldn’t have collaborated as happily with somebody of that persuasion.

So that fact that Hanna et al. explicitly position their analysis in “critical race” methods is a signpost that they are actually trying to accomplish a much more specifically disciplinarily informed project than we were. Sadly, they did not get into the question of how “critical race methodology” differs from other methodologies one might use to study race. That’s too bad, as it supports what I feel is a stifling hegemony that particular discourse has over discussions of race and technology.

The Google team is supportive of the most important contribution of our paper–that racial categories are problematic and that this needs to be addressed in the fairness in AI literature. They then go on to argue against out proposed solution of “using an unsupervised machine learning method to create race-like categories which aim to address “historical racial segregation with reproducing the political construction of racial categories.”” (their rendering). I will defend our solution here.

Their first claim:

First, it would be a grave error to supplant the existing categories of race with race-like categories inferred by unsupervised learning methods. Despite the risk of reifying the socially constructed idea called race, race does exist in the world, as a way of mental sorting, as a discourse which is adopted, as a social thing which has both structural and ideological components. In other words, although race is social constructed, race still has power. To supplant race with race-like categories for the purposes of measurement sidesteps the problem.

This paragraph does feel very “critical race studies” to me, in that it makes totalizing claims about the work race does in society in a way that precludes the possibility of any concrete or focused intervention. I think they misunderstand our proposal in the following ways:

  • We are not proposing that, at a societal and institutional level, we institute a new, stable system of categories derived from patterns of segregation. We are proposing that, ideally, temporary quasi-racial categories are derived dynamically from data about segregation in a way that destabilizes the social mechanisms that reproduce racial hierarchy, reducing the power of those categories.
  • This is proposed as an intervention to be adopted by specific technical systems, not at the level of hegemonic political discourse. It is a way of formulating an anti-racist racial project by undermining the way categories are maintained.
  • Indeed, the idea is to sidestep the problem, in the sense that it is an elegant way to reduce the harm that the problem does. Sidestepping is, imagine it, a way of avoiding a danger. In this case, that danger is the reification of race in large scale digital platforms (for example).

Next, they argue:

Second, supplanting race with race-like categories depends highly on context, namely how race operates within particular systems of inequality and domination. Benthall and Haynes restrict their analysis to that of spatial segregation, which is to be sure, an important and active research area and subject of significant policy discussion (e.g. [76, 99]). However, that metric may appear illegible to analyses pertaining to other racialized institutions, such as the criminal justice system, education, or employment (although one can readily see their connections and interdependencies). The way that race matters or pertains to particular types of structural inequality depends on that context and requires its own modes of operationalization

Here, the Google team takes the anthropological turn and, like many before them, suggests that a general technical proposal is insufficient because it is not sufficiently contextualized. Besides echoing the general problem of the ineffectualness of anthropological methods in technology ethics, they also mischaracterize our paper by saying we restrict our analysis to spatial segregation. This is not true: in the paper we generalize our analysis to social segregation, as in on a social network graph. Naturally, we would be (a) interested in and open to other systems of identifying race as a feature of social structure, and (b) would want to tailor data over which any operationalization technique was applied, where appropriate, to technical and functional context. At the same time, we are on quite solid ground in saying that racial is structural and systemic, and in a sense defined at a holistic societal level as much as it has ramifications in, and is impacted by, the micro- and contextual level as well. As we are approaching the problem from a structural sociological one, we can imagine a structural technical solution. This is an advantage of the method over a more anthropological one.

Third:

At the same time we focus on the ontological aspects of race (what is race, how is it constituted and imagined in the world), it is necessary to pay attention to what we do with race and measures which may be interpreted as race. The creation of metrics and indicators which are race-like will still be interpreted as race.

This is a strange criticism given that one of the potential problems with our paper is that the quasi-racial categories we propose are not interpretable. The authors seem think that our solution involves the institution of new quasi-racial categories at the level of representation or discourse. That’s not what we’ve proposed. We’ve proposed a design for a machine learning system which, we’d hope, would be understood well enough by its engineers to work as an intervention. Indeed, the correlation of the quasi-racial categories with socially recognized racial ones is important if they are to ground fairness interventions; the purpose of our proposed solution is narrowly to allow for these interventions without the reification of the categories.

Enough defense. There is a point the Google team insists on which strikes me as somewhat odd and to me signals a further weakness of their hyper contextualized method: its inability to generalize beyond the hermeneutic cycles of “critical race theory”.

Hanna et al. list several (seven) different “dimensions of race” based on different ways race can be ascribed, inferred, or expressed. There is, here, the anthropological concern with the individual body and its multifaceted presentations in the complex social field. But they explicitly reject one of the most fundamental ways in which race operates at a transpersonal and structural level, which is through families and genealogy. This is well-intentioned but ultimately misguided.

Note that we have excluded “racial ancestry” from this table. Genetics, biomedical researchers, and sociologists of science have criticized the use of “race” to describe genetic ancestry within biomedical research [40, 49, 84, 122], while others have criticized the use of direct-to-consumer genetic testing and its implications for racial and ethnic identification [15, 91, 113]

In our paper, we take pains to point out responsibly how many aspects of racial, such as phenotype, nationality (through citizenship rules), and class signifiers (through inheritance) are connected with ancestry. We, of course, do not mean to equate ancestry with race. Nor, especially, are we saying that there are genetic racialized qualities besides perhaps those associated with phenotype. We are also not saying that direct-to-consumer genetic test data is what institutions should be basing their inference of quasi-racial categories on. Nothing like that.

However, speaking for myself, I believe that an important aspect of how race functions at a social structural level is how it implicates relations of ancestry. A. Rao perhaps puts the point better: race is a system of inherited privilege, and racial discrimination is more often than not discrimination based on descent.

Understanding this about race allows us to see what race has in common with other systems of categorical inequality, such as the caste system. And here was a large part of the point of offering an algorithmic solution: to suggest a system for identifying inequality that transcends the logic of what is currently recognized within the discourse of “critical race theory” and anticipates forms of inequality and discrimination that have not yet been so politically recognized. This will become increasingly an issue when a pluralistic society (or user base of an on-line platform) interacts with populations whose categorical inequalities have different histories and origins besides the U.S. racial system. Though our paper used African-Americans as a referent group, the scope of our proposal was intentionally much broader.

References

Benthall, S., & Haynes, B. D. (2019, January). Racial categories in machine learning. In Proceedings of the Conference on Fairness, Accountability, and Transparency (pp. 289-298).

Hanna, A., Denton, E., Smart, A., & Smith-Loud, J. (2020, January). Towards a critical race methodology in algorithmic fairness. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (pp. 501-512).

All the problems with our paper, “Racial categories in machine learning”

Bruce Haynes and I were blown away by the reception to our paper, “Racial categories in machine learning“. This was a huge experiment in interdisciplinary collaboration for us. We are excited about the next steps in this line of research.

That includes engaging with criticism. One of our goals was to fuel a conversation in the research community about the operationalization of race. That isn’t a question that can be addressed by any one paper or team of researchers. So one thing we got out of the conference was great critical feedback on potential problems with the approach we proposed.

This post is an attempt to capture those critiques.

Need for participatory design

Khadijah Abdurahman, of Word to RI , issues a subtweeted challenge to us to present our paper to the hood. (RI stands for Roosevelt Island, in New York City, the location of the recently established Cornell Tech campus.)

One striking challenge, raised by Khadijah Abdurahman on Twitter, is that we should be developing peer relationships with the communities we research. I read this as a call for participatory design. It’s true this was not part of the process of the paper. In particular, Ms. Abdurahman points to a part of our abstract that uses jargon from computer science.

There are a lot of ways to respond to this comment. The first is to accept the challenge. I would personally love it if Bruce and I could present our research to folks on Roosevelt Island and get feedback from them.

There are other ways to respond that address the tensions of this comment. One is to point out that in addition to being an accomplished scholar of the sociology of race and how it forms, especially in urban settings, Bruce is a black man who is originally from Harlem. Indeed, Bruce’s family memoir shows his deep and well-researched familiarity with the life of marginalized people of the hood. So a “peer relationship” between an algorithm designer (me) and a member of an affected community (Bruce) is really part of the origin of our work.

Another is to point out that we did not research a particular community. Our paper was not human subjects research; it was about the racial categories that are maintained by the Federal U.S. government and which pervade society in a very general way. Indeed, everybody is affected by these categories. When I and others who looks like me are ascribed “white”, that is an example of these categories at work. Bruce and I were very aware of how different kinds of people at the conference responded to our work, and how it was an intervention in our own community, which is of course affected by these racial categories.

The last point is that computer science jargon is alienating to basically everybody who is not trained in computer science, whether they live in the hood or not. And the fact is we presented our work at a computer science venue. Personally, I’m in favor of universal education in computational statistics, but that is a tall order. If our work becomes successful, I could see it becoming part of, for example, a statistical demography curriculum that could be of popular interest. But this is early days.

The Quasi-Racial (QR) Categories are Not Interpretable

In our presentation, we introduced some terminology that did not make it into the paper. We named the vectors of segregation derived by our procedure “quasi-racial” (QR) vectors, to denote that we were trying to capture dimensions that were race-like, in that they captured the patterns of historic and ongoing racial injustice, without being the racial categories themselves, which we argued are inherently unfair categories of inequality.

First, we are not wedded to the name “quasi-racial” and are very open to different terminology if anybody has an idea for something better to call them.

More importantly, somebody pointed out that these QR vectors may not be interpretable. Given that the conference is not only about Fairness, but also Accountability and Transparency, this critique is certainly on point.

To be honest, I have not yet done the work of surveying the extensive literature on algorithm interpretability to get a nuanced response. I can give two informal responses. The first is that one assumption of our proposal is that there is something wrong with how race and racial categories are intuitive understood. Normal people’s understanding of race is, of course, ridden with stereotypes, implicit biases, false causal models, and so on. If we proposed an algorithm that was fully “interpretable” according to most people’s understanding of what race is, that algorithm would likely have racist or racially unequal outcomes. That’s precisely the problem that we are trying to get at with our work. In other words, when categories are inherently unfair, interpretability and fairness may be at odds.

The second response is that educating people about how the procedure works and why its motivated is part of what makes its outcomes interpretable. Teaching people about the history of racial categories, and how those categories are both the cause and effect of segregation in space and society, makes the algorithm interpretable. Teaching people about Principal Component Analysis, the algorithm we employ, is part of what makes the system interpretable. We are trying to drop knowledge; I don’t think we are offering any shortcuts.

Principal Component Analysis (PCA) may not be the right technique

An objection from the computer science end of the spectrum was that our proposed use of Principal Component Analysis (PCA) was not well-motivated enough. PCA is just one of many dimensionality reduction techniques–why did we choose it in particular? PCA has many assumptions about the input embedded within it, including the component vectors of interest are linear combinations of the inputs. What if the best QR representation is a non-linear combination of the input variables? And our use of unsupervised learning, as a general criticism, is perhaps lazy, since in order to validate its usefulness we will need to test it with labeled data anyway. We might be better off with a more carefully calibrated and better motivated alternative technique.

These are all fair criticisms. I am personally not satisfied with the technical component of the paper and presentation. I know the rigor of the analysis is not of the standard that would impress a machine learning scholar and can take full responsibility for that. I hope to do better in a future iteration of the work, and welcome any advice on how to do that from colleagues. I’d also be interested to see how more technically skilled computer scientists and formal modelers address the problem of unfair racial categories that we raised in the paper.

I see our main contribution as the raising of this problem of unfair categories, not our particular technical solution to it. As a potential solution, I hope that it’s better than nothing, a step in the right direction, and provocative. I subscribe to the belief that science is an iterative process and look forward to the next cycle of work.

Please feel free to reach out if you have a critique of our work that we’ve missed. We do appreciate all the feedback!

For fairness in machine learning, we need to consider the unfairness of racial categorization

Pre-prints of papers accepted to this coming 2019 Fairness, Accountability, and Transparency conference are floating around Twitter. From the looks of it, many of these papers add a wealth of historical and political context, which I feel is a big improvement.

A noteworthy paper, in this regard, is Hutchinson and Mitchell’s “50 Years of Test (Un)fairness: Lessons for Machine Learning”, which puts recent ‘fairness in machine learning’ work in the context of very analogous debates from the 60’s and 70’s that concerned the use of testing that could be biased due to cultural factors.

I like this paper a lot, in part because it is very thorough and in part because it tees up a line of argument that’s dear to me. Hutchinson and Mitchell raise the question of how to properly think about fairness in machine learning when the protected categories invoked by nondiscrimination law are themselves social constructs.

Some work on practically assessing fairness in ML has tackled the problem of using race as a construct. This echoes concerns in the testing literature that stem back to at least 1966: “one stumbles immediately over the scientific difficulty of establishing clear yardsticks by which people can be classified into convenient racial categories” [30]. Recent approaches have used Fitzpatrick skin type or unsupervised clustering to avoid racial categorizations [7, 55]. We note that the testing literature of the 1960s and 1970s frequently uses the phrase “cultural fairness” when referring to parity between blacks and whites.

They conclude that this is one of the areas where there can be a lot more useful work:

This short review of historical connections in fairness suggest several concrete steps forward for future research in ML fairness: Diving more deeply into the question of how subgroups are defined, suggested as early as 1966 [30], including questioning whether subgroups should be treated as discrete categories at all, and how intersectionality can be modeled. This might include, for example, how to quantify fairness along one dimension (e.g., age) conditioned on another dimension (e.g., skin tone), as recent work has begun to address [27, 39].

This is all very cool to read, because this is precisely the topic that Bruce Haynes and I address in our FAT* paper, “Racial categories in machine learning” (arXiv link). The problem we confront in this paper is that the racial categories we are used to using in the United States (White, Black, Asian) originate in the white supremacy that was enshrined into the Constitution when it was formed and perpetuated since then through the legal system (with some countervailing activity during the Civil Rights Movement, for example). This puts “fair machine learning” researchers in a bind: either they can use these categories, which have always been about perpetuating social inequality, or they can ignore the categories and reproduce the patterns of social inequality that prevail in fact because of the history of race.

In the paper, we propose a third option. First, rather than reify racial categories, we propose breaking race down into the kinds of personal features that get inscribed with racial meaning. Phenotype properties like skin type and ocular folds are one such set of features. Another set are events that indicate position in social class, such as being arrested or receiving welfare. Another set are facts about the national and geographic origin of ones ancestors. These facts about a person are clearly relevant to how racial distinctions are made, but are themselves more granular and multidimensional than race.

The next step is to detect race-like categories by looking at who is segregated from each other. We propose an unsupervised machine learning technique that works with the distribution of the phenotype, class, and ancestry features across spatial tracts (as in when considering where people physically live) or across a social network (as in when considering people’s professional networks, for example). Principal component analysis can identify what race-like dimensions capture the greatest amounts of spatial and social separation. We hypothesize that these dimensions will encode the ways racial categorization has shaped the social structure in tangible ways; these effects may include both politically recognized forms of discrimination as well as forms of discrimination that have not yet been surfaced. These dimensions can then be used to classify people in race-like ways as input to fairness interventions in machine learning.

A key part of our proposal is that race-like classification depends on the empirical distribution of persons in physical and social space, and so are not fixed. This operationalizes the way that race is socially and politically constructed without reifying the categories in terms that reproduce their white supremacist origins.

I’m quite stoked about this research, though obviously it raises a lot of serious challenges in terms of validation.

“The Theory of Racial Formation”: notes, part 1 (Cha. 4, Omi and Winant, 2014)

Chapter 4 of Omi and Winant (2014) is “The Theory of Racial Formation”. It is where they lay out their theory of race and its formation, synthesizing and improving on theories of race as ethnicity, race as class, and race as nation that they consider earlier in the book.

This rhetorical strategy of presenting the historical development of multiple threads of prior theory before synthesizing them into something new is familiar to me from my work with Helen Nissenbaum on Contextual Integrity. CI is a theory of privacy that advances prior legal and social theories by teasing out their tensions. This seems to be a good way to advance theory through scholarship. It is interesting that the same method of theory building can work in multiple fields. My sense is that what’s going on is that there is an underlying logic to this process which in a less Anglophone world we might call “dialectical”. But I digress.

N.B. Sep. 17 2020 – These informal notes were part of the process of writing “Racial categories in machine learning”, with Bruce Haynes.

I have not finished Chapter 4 yet but I wanted to sketch out the outline of it before going into detail. That’s because what Omi and Winant are presenting a way of understanding the mechanisms behind the reproduction of race that are not simplistically “systemic” but rather break it down into discrete operations. This is a helpful contribution; even if the theory is not entirely accurate, its very specificity elevates the discourse.

So, in brief notes:

For Omi and Winant, race is a way of “making up people”; they attribute this phrase to Ian Hacking but do not develop Hacking’s definition. Their reference to a philosopher of science does situate them in a scholarly sense; it is nice that they seem to acknowledge an implicit hierarchy of theory that places philosophy at the foundation. This is correct.

Race-making is a form of othering, of having a group of people identify another group as outsiders. Othering is a basic and perhaps unavoidable human psychological function; their reference for this is powell and Menendian (Apparently, john a. powell being one of these people like danah boyd who decapitalizes their name.)

Race is of course a social construct that is neither a fixed and coherent category nor something that is “unreal”. That is, presumably, why we need a whole book on the dynamic mechanisms that form it. One reason why race is such a dynamic concept is because (a) it is a way of organizing inequality in society, (b) the people on “top” of the hierarchy implied by racial categories enforce/reproduce that category “downwards”, (c) the people on the “bottom” of the hierarchy implied by racial categories also enforce/reproduce a variation of those categories “upwards” as a form of resistance, and so (d) the state of the racial categories at any particular time is a temporary consequence of conflicting “elite” and “street” variations of it.

This presumes that race is fundamentally about inequality. Omi and Winant believe it is. In fact, they think racial categories are a template for all other social categories that are about inequality. This is what they mean by their claim that race is a master category. It’s “a frame used for organizing all manner of political thought”, particularly political thought about liberation struggles.

I’m not convinced by this point. They develop it with a long discussion of intersectionality that is also unconvincing to me. Historically, they point out that sometimes women’s movements have allied with black power movements, and sometimes they haven’t. They want the reader to think this is interesting; as a data scientist, I see randomness and lack of correlation. They make the poignant and true point that “perhaps at the core of intersectionality practice, as well as theory, is the ‘mixed race’ category. Well, how does it come about that people can be ‘mixed’?” They then drop the point with no further discussion.

[Edit: While Omi and Winant do address the issue of what it means to be ‘mixed race’ in more depth later in the book, their treatment of intersectionality remains for me difficult. Race is a system of political categorization; however, racial categories are hereditary in a way that sexual categories are not. That is an important difference in how the categories are formed and maintained, one that is glossed over in O&W’s treatment of the subject, as well as in popular discourse.]

Omi and Winant make an intriguing comment, “In legal theory, the sexual contract and racial contract have often been compared”. I don’t know what this is about but I want to know more.

This is all a kind of preamble to their presentation of theory. They start to provide some definitions:

racial formation
The sociohistorical process by which racial identities are created, lived out, transformed, and destroyed.
racialization
How phenomic-corporeal dimensions of bodies acquire meaning in social life.
racial projects
The co-constitutive ways that racial meanings are translated into social structures and become racially signified.
racism
A property of racial projects that Omi and Winant will discuss in Chapter 4.
racial politics
Ways that the politics (of a state?) can handle race, including racial despotism, racial democracy, and racial hegemony.

This is a useful breakdown. More detail in the next post.

population traits, culture traits, and racial projects: a methods challenge #ica18

In a recent paper I’ve been working on with Mark Hannah that he’s presenting this week at the International Communications Association conference, we take on the question of whether and how “big data” can be used to study the culture of a population.

By “big data” we meant, roughly large social media data sets. The pitfalls of using this sort of data for any general study of a population are perhaps best articled by Tufekci (2014). In short: studies based on social media data are often sampling on the dependent variable because they only consider the people representing themselves on social media, though this is only a small portion of the population. To put it another way, the sample suffers from the 1% rule of Internet cultures: for any on-line community, only 1% create content, 10% interact with the content somehow, and the rest lurk. The behavior and attitudes of the lurkers, in addition to any field effects in the “background” of the data (latent variables in the social field of production), are all out of band and so opaque to the analyst.

By “the culture of a population”, we meant something specific: the distribution of values, beliefs, dispositions, and tastes of a particular group of people. The best source we found on this was Marsden and Swingle (1994), and article from a time before the Internet had started to transform academia. Then and perhaps now, the best way to study the distribution of culture across a broad population was a survey. The idea is that you sample the population according to some responsible statistics, you ask them some questions about their values, beliefs, dispositions, and tastes, and you report the results. Viola!

(Given the methodological divergence here, the fact that many people, especially ‘people on the Internet’, now view culture mainly through the lens of other people on the Internet is obviously a huge problem. Most people are not in this sample, and yet we pretend that it is representative because it’s easily available for analysis. Hence, our concept of culture (or cultures) is screwy, reflecting much more than is warranted whatever sorts of cultures are flourishing in a pseudonymous, bot-ridden, commercial attention economy.)

Can we productively combine social media data with surveys methods to get a better method for studying the culture of a population? We think so. We propose the following as a general method framework:

(1) Figure out the population of interest by their stable, independent ‘population traits’ and look for their activity on social media. Sample from this.

(2) Do exploratory data analysis to inductively get content themes and observations about social structure from this data.

(3) Use the inductively generated themes from step (2) to design a survey addressing cultural traits of the population (beliefs, values, dispositions, tastes).

(4) Conduct a stratified sample specifically across social media creators, synthesizers (e.g. people who like, retweet, and respond), and the general population and/or known audience, and distribute the survey.

(5) Extrapolate the results to general conclusions.

(6) Validate the conclusions with other data or not discrepancies for future iterations.

I feel pretty good about this framework as a step forward, except that in the interest of time we had to sidestep what is maybe the most interesting question raised by it, which is: what’s the difference between a population trait and a cultural trait.

Here’s what we were thinking:

Population trait Cultural trait
Location Twitter use (creator, synthesizer, lurker, none)
Age Political views: left, right, center
Permanent unique identifier Attitude towards media
Preferred news source
Pepsi or coke?

One thing to note: we decided that traits about media production and consumption were a subtype of cultural traits. I.e., if you use Twitter, that’s a particular cultural trait that may be correlated with other cultural traits. That makes the problem of sampling on the dependent variable explicit.

But the other thing to note is that there are certain categories that we did not put on this list. Which ones? Gender, race, etc. Why not? Because choosing whether these are population traits or cultural traits opens a big bag of worms that is the subject of active political contest. That discussion was well beyond the scope of the paper!

The dicey thing about this kind of research is that we explicitly designed it to try to avoid investigator bias. That includes the bias of seeing the world through social categories that we might otherwise naturalize of reify. Naturally, though, if we were to actually conduct this method on a sample, such as, I dunno, a sample of Twitter-using academics, we would very quickly discover that certain social categories (men, women, person of color, etc.) were themes people talked about and so would be included as survey items under cultural traits.

That is not terrible. It’s probably safer to do that than to treat them like immutable, independent properties of a person. It does seem to leave something out though. For example, say one were to identify race as a cultural trait and then ask people to identify with a race. Then one takes the results, does a factor analysis, and discovers a factor that combines a racial affinity with media preferences and participation rates. It then identifies the prevalence of this factor in a certain region with a certain age demographic. One might object to this result as a representation of a racial category as entailing certain cultural categories, and leaving out the cultural minority within a racial demographic that wants more representation.

This is upsetting to some people when, for example, Facebook does this and allows advertisers to target things based on “ethnic affinity”. Presumably, Facebook is doing just this kind of factor analysis when they identify these categories.

Arguably, that’s not what this sort of science is for. But the fact that the objection seems pertinent is an informative intuition in its own right.

Maybe the right framework for understanding why this is problematic is Omi and Winant’s racial formation theory (2014). I’m just getting into this theory recently, at the recommendation of Bruce Haynes, who I look up to as an authority on race in America. According to racial projects theory, racial categories are stable because they include both representations of groups of people as having certain qualities and social structures controlling the distribution of resources. So, the white/black divide in the U.S. is both racial stereotypes and segregating urban policy, because the divide is stable because of how the material and cultural factors reinforce each other.

This view is enlightening because it helps explain why hereditary phenotype, representations of people based on hereditary phenotype, requests for people to identify with a race even when this may not make any sense, policies about inheritance and schooling, etc. all are part of the same complex. When we were setting out to develop the method described above, we were trying to correct for a sampling bias in media while testing for the distribution of culture across some objectively determinable population variables. But the objective qualities (such as zip code) are themselves functions of the cultural traits when considered over the course of time. In short, our model, which just tabulates individual differences without looking at temporal mechanisms, is naive.

But it’s a start, if only to an interesting discussion.

References

Marsden, Peter V., and Joseph F. Swingle. “Conceptualizing and measuring culture in surveys: Values, strategies, and symbols.” Poetics 22.4 (1994): 269-289.

Omi, Michael, and Howard Winant. Racial formation in the United States. Routledge, 2014.

Tufekci, Zeynep. “Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls.” ICWSM 14 (2014): 505-514.