From Human to Data to Dataset: Mapping the Traceability of Human Subjects in Computer Vision Datasets

Morgan Klaus Scheuerman
ACM CSCW
Published in
4 min readAug 4, 2023

--

A crowd of people crossing the street in Shibuya
Photo by ryoji__iwata on Unsplash

This blog post summarizes the paper From Human to Data to Dataset: Mapping the Traceability of Human Subjects in Computer Vision Datasetsabout the challenges facing human data subjects in computer vision datasets. This paper will be presented at the 26th ACM Conference on Computer-Supported Cooperative Work and Social Computing, a top venue for social computing scholarship. It will also be published in the journal Proceedings of the ACM (PACM).

Computer vision is a “data hungry” field. Researchers and practitioners who work on computer vision, like facial recognition, emphasize the need for vast amounts of data. Humans are seen as a data resource which can be converted into datasets. Many common computer vision datasets consist entirely of images depicting real people, their images scraped from the web or captured on public streets, without their knowledge or consent. Platforms like Flickr, YouTube, Instagram, have become a robust data resource.

We can imagine how this practice might impact the hundreds to thousands of humans subjects in those images. For example, consider the hypothetical Jordan, an events photographer; they upload a portfolio of their work to Flickr, an online image hosting website. Their account is filled with photos of weddings, family birthday parties, and live concerts — hundreds of images of people celebrating moments large and small. Both Jordan and their subjects are unaware that those images have been scraped by multiple researchers and aggregated with other Flickr users’ images into multiple datasets. Datasets for facial detection, scene understanding, gender classification, and even facial beauty ratings all include Jordan’s images, and the faces contained therein. We can imagine how Jordan’s subjects go on to be used to fuel computer vision research across industry and academia. Some models may be deployed commercially, the data used to train them thus contributing to millions of dollars in sales. Years after, some of those datasets have disappeared, their creators silently retiring them; but copies still exist in other data repositories and the images still exist within models circulating in academic and production settings. Even if Jordan’s subjects were aware of their images being used in one dataset, how could they trace even a single dataset’s life to all the other places it has ended up?

Our work focuses on the traceability of data subjects: the ability for one to trace a single piece of data throughout a dataset’s lifecycle. We trace the practices of computer vision dataset development, from their original data to dataset dissemination and use. We systematically examine moments of transformation within the dataset development pipeline where the human data subject is fundamental — and also becomes increasingly difficult to trace.

Specifically, we conducted a content analysis of 125 unique computer vision datasets that stem from public data, either from the web, from physical public spaces, or from public records. Employing both structured content analysis and qualitative content analysis, we present findings that describe dataset curation processes: where data is often collected from, what kind of data subjects are often featured in datasets, and how those datasets are disseminated to research communities.

A screenshot of a table in the paper describing findings related to each stage of the study
Summary of findings relating to dataset collection and packaging. Totals do not always add up to100% because some datasets include multiple variables (e.g., both regular people and celebrities).

We highlight two major issues in current dataset practices preventing data subject traceability: awareness and control. We aim to advance what we call an ethics of traceability: the issues surrounding data subject awareness of their data usage and the possibility of control over their data. We argue that both awareness and control must be present to properly incorporate an ethics of traceability — allowing data subjects to be informed and aware of their data use and to exercise control over their data throughout the dataset curation pipeline.

A screenshot of a table in the paper outlining potential considerations for improving the ethics of traceability throughout dataset development
Potential considerations for Awareness and Control for each phase and step of the dataset pipeline.

We contribute key points of intervention across the dataset curation pipeline for dataset authors to attend to issues of traceability. We propose considerations for enabling both awareness and control on behalf of the data subjects featured in datasets along these intervention points.

Morgan Klaus Scheuerman, Katy Weathington, Tarun Mugunthan, Emily Denton, and Casey Fiesler. 2023. From Human to Data to Dataset: Mapping the Traceability of Human Subjects in Computer Vision Datasets. Proc. ACM Hum.-Comput. Interact. 7, CSCW1, Article 55 (April 2023), 33 pages. https://doi.org/10.1145/3579488

--

--