Fake AI 

Edited by Frederike Kaltheuner

Meatspace Press (2021)

Book release: 14/12/2021

This book is an intervention - 

Chapter 5

The case for interpretive techniques in machine learning

By Razvan Amironesei, Emily Denton, Alex Hanna, Hilary Nicole, Andrew Smart

Many modern AI systems are designed to ingest and analyze massive datasets so as to make recommendations, predictions, or inferences on unseen inputs ranging from images to pieces of text and other forms of data. These datasets often reflect patterns of inequity that exist in the world,1 and yet the data-driven nature of AI systems often serves to obscure the technology’s limitations within a pervasive rhetoric of objectivity.2 As AI technologies and methods are increasingly incorporated into all aspects of social life, often in ways that increase and accelerate existing social inequities, this is especially important. In this piece, we examine how and why appeals to objectivity are so deeply embedded in technological discourses and practices. As Ruha Benjamin notes, routing algorithmic bias through a rhetoric of objectivity can make it “even more difficult to challenge it and hold individuals and institutions accountable.”3 Our starting question is: how, and under which conditions, do truth claims that are embedded in algorithmic systems and associated data practices function as a justification for a myriad of harms?

A way for us to answer this question lies in understanding and accounting for how material artefacts—that is to say the instruments, devices, and sociotechnical systems—contribute to an understanding of algorithmic systems as objective or scientific in nature. Of these artefacts, benchmark datasets play a crucial role in the constitution of the machine learning life cycle. These datasets are used in the training and development of artificial intelligence. They establish a “gold standard” for specific AI tasks, defining the ideal outputs of an AI system for a set range of exemplar data inputs. Benchmark datasets can be understood as measurements for assessing and comparing different AI algorithms. Within AI research communities, performance on benchmark datasets is often understood as indicative of research progress on a particular AI task. Benchmarks are the equivalent of IQ tests for algorithms. Just as IQ tests are controversial because it is unclear what exactly they measure about human intelligence, what benchmark datasets are supposed to be measuring about algorithms has never been fully articulated. And while the role of IQ tests in historically providing justification for white supremacist beliefs is well recognised, despite these data infrastructures having significant social impact through the dissemination of unjust biases, they remain strikingly under-theorised and barely understood, or even acknowledged, in the public sphere. In what follows, we will address benchmarks via two definitions which entail, as we will see, different types of problems and critiques.

Defined from a purely technical perspective, a benchmark is “a problem that has been designed to evaluate the performance of a system [which] is subjected to a known workload and the performance of the system against this workload is measured.” The objective is to compare “the measured performance” with “that of other systems that have been subject to the same benchmark test.”4

In order to illustrate the limits of a purely technical understanding of benchmark datasets, let’s briefly discuss ImageNet, a large visual dataset which is used in visual object recognition research. Its stated objective is to map “the entire world of objects.”5 The dataset contains more than 14 million hand-annotated images, and was one of the largest created at the time, making it one of the most important benchmark datasets in the field of computer vision. However, as research led by Kate Crawford and Trevor Paglen has illustrated, ImageNet does more than annotate objects with relatively straightforward descriptions (such as “apple”, “round” or “green”).6 The dataset contains a significant number of categorisations which can only be described as depreciative and derogatory. For instance, a photograph of a woman in a bikini is labelled “slattern, slut, slovenly woman, trollop” and a man drinking a beer as “alcoholic, alky, dipsomaniac, boozer, lush, soaker, souse.” So, how is it that morally degrading, misleading, and defamatory descriptions shape the purportedly objective structure of the benchmark dataset?

One explanation lies in the fact that benchmark datasets are typically perceived by the machine learning community as purely technical devices which provide factually objective data. This is the case of ImageNet, in which a particular label attached to an image is often interpreted as a truth claim about the nature of the object or phenomenon depicted. As a result, benchmark datasets operate on—and reinforce—the assumption that they represent some “fact” in the world. A purely technical definition of benchmarks also does not take into account how social, ethical and political factors shape the dataset. Clearly, we should question the assumption that technical objectivity is somehow embedded in benchmark datasets. A different framing of benchmark datasets—which takes into account the context of the production process that shaped them into existence—is needed.

In this reformulated definition, benchmarks can be understood as socio-technical measurements, governed by specific norms that, in turn, act as standards of evaluation. As we can see, for example with the ImageNet Large Scale Visual Recognition Challenge,7 state-of-the-art performance on the benchmark challenges came to be understood as not only a measure of success on the specific formulation of object recognition represented in the benchmark, but as a much broader indicator of AI research progress. Researchers who have produced state-of-the-art performance on benchmark datasets have gone on to receive prestigious positions in large industry labs, or received massive funding for AI startups.8 Such a pervasive view of benchmark datasets as value-neutral markers of progress is both misguided and dangerous. Benchmark datasets can be more appropriately understood as tools that institute and enforce specific norms, which establish normative modes of functional accuracy, error rates and such. In other words, benchmark datasets are measurement devices that function as a regulative mechanism for assessing and enforcing a particular standard in a technical setting. When understood this way, it becomes clear that the histories of their production and epistemological limits should be thoroughly documented. Current benchmarking practices reveal troubling moral problems, in particular when gold standards become petrified in a field of practice, and are uncritically accepted and reproduced. In such circumstances, benchmarks normalize and perpetuate arbitrary assumptions which function as normalization mechanisms.

For us, as researchers in the field of AI, to address the harmful effects of benchmarks understood as normalizing mechanisms, we propose cultivating a careful and responsible critique which analyses the formation of meaning inherent in these technologies. By understanding the socio-technical nature of dataset production and their contingent genesis, we create the conditions to stem the myriad harms mentioned in our introduction. In a word, analysing datasets along technical, social, and ethical axes reveals their contestable modes of construction. Datasets have a socio-ethical reality that is distinct from their socio-technical dimension. As such, it’s possible to recognise datasets as the contextual product, or contingent outcome, of normative constraints and conflicts between various stakeholders with visible or concealed agendas. Thus, the representational harms that we have referenced in the context of ImageNet are not simply the unexpected and unfortunate effect of purportedly objective algorithmic systems, but, most importantly, they can be traced back to their interpretive origins, that is, the underlying conditions, presuppositions, practices and values embedded in dataset creation.

By placing (distorted and degrading) labels on how various beings and objects exist in the world, benchmark datasets exert a computational power of naming. This power of naming—which mimics and recalls, for example, Linnaeus’ Promethean efforts in Species Plantarum—operates as a power to classify and catalogue all the existing objects in the world. By labelling an object available to the computational gaze, the dataset grants and withdraws recognition based on the socio-technical assessment of the object’s identity. Left unchecked, this mode of perceiving and labelling the world amounts to accepting and scaling the reproduction and normalization of representational harms.

To address this problem, we seek to validate benchmark datasets by analysing their internal modes of construction. We aim to do so by using techniques of interpretation which establish rules for understanding the relation between data collection practices and the ways in which they shape models of algorithmic development. In particular, a key necessary step is performing an interpretive socio-ethical analysis of how and when training sets should be used in the creation of benchmark datasets. This should cover: the disclosures associated with the work and research on specific datasets; the stakeholders involved and their reflective or unreflective intent, data practices, norms and routines that structure the data collection, and the implicit or explicit values operative in their production; the veiled or specific assumptions of their authors and curators; the relation between the intended goals of the original authors and the actual outcomes; and the adoption of datasets and the related practices of contestation by subsequent researchers. This way, we will be in a position to adequately identify and interrogate the historical conditions of dataset production, document their related norms, practices and axiological hierarchies, and thereby both reveal and prevent the excesses that currently operate in the machine learning pipeline.

Dr Razvan Amironesei is a research fellow in data ethics at the University of San Francisco and a visiting researcher at Google, currently working on key topics in algorithmic fairness.

Dr Emily Denton is a Senior Research Scientist on Google’s Ethical AI team, studying the norms, values, and work practices that structure the development and use of machine learning datasets.

Dr Alex Hanna is a sociologist and researcher at Google.

Hilary Nicole is a researcher at Google.

Andrew Smart is a researcher at Google working on AI governance, sociotechnical systems, and basic research on conceptual foundations of AI.

All authors contributed equally.


1. Smith, A. (2019, 12 December) AI: Discriminatory Data In, Discrimination Out. SHRM.https://www.shrm.org/resourcesandtools/legal-and-compliance/employment-law/pages/artificial-intelligence-discriminatory-data.aspx; Richardson, R. Schultz, J., Crawford, K. (2019) Dirty data, bad predictions: How civil rights violations impact police data, predictive policing systems, and justice. New York University Law Review Online 94:192. SSRN: 3333423

2. Waseem, Z., Lulz, S., Bingel, J. & Augenstein, I. (2021) Disembodied Machine Learning: On the Illusion of Objectivity in NLP. https://arxiv.org/abs/2101.11974

3. Benjamin, R. (2019) Race After Technology: Abolitionist Tools for the New Jim Code. Medford, MA: Polity Press, p.122.

4. Butterfield, A., Ekembe Ngondi, G. & Kerr, A. (Eds) (2016) A Dictionary of Computer Science, 7th edition. Oxford: Oxford University Press.

5. Gershgorn, D. (2017, 26 July) The data that transformed AI research – and possibly the world. Quartz. https://qz.com/1034972/the-data-that-changed-the-direction-of-ai-research-and-possibly-the-world/

6. https://excavating.ai/

7. Russakovsky, O., Deng, J., Su, H., et al. (2015) ImageNet Large Scale Visual Recognition Challenge. https://arxiv.org/abs/1409.0575
8. Metz, C. (2021, 16 March) The secret auction that set off the race for AI supremacy. Wired.https://www.wired.com/story/secret-auction-race-ai-supremacy-google-microsoft-baidu/

Instagram        Twitter