But by serving as benchmarks and tools for calibration, datasets can also haunt infrastructures of Artificial Intelligence.”–Eryk Salvaggio

In a new blog post, Eryk Salvaggio, @cyberneticforests, a 2024 Flickr Foundation Research Fellow, explores where archives, datasets, and infrastructures intersect, shaping the nature of AI-generated images. Salvaggio introduces the concept of “ghosts” that haunt these synthetic creations—absences and decisions embedded in the training data and infrastructure that guide the final output.

Every image generated by AI calls up a line of ghosts. They haunt the training data, where the contexts of photographs are reduced to the simplest of descriptions. They linger in the decisions of engineers and designers in what labels to use. The ghosts that haunt the generated image are hidden by design, but we can find them through their traces. We just need to know how to look.”–Salvaggio

Salvaggio’s exploration of Flickr’s multifaceted role as an archive, dataset, and infrastructure illuminates a profound semiotic shift. As images evolve from individual context to scaled abstraction, their meaning and interpretation radically transform. They cease to be mere visual representations and become data points in a vast system. This shift fundamentally reshapes our understanding of their significance and potential influence on AI systems.

The author’s analysis of the FFHQ dataset, derived from 70,000 Flickr portraits, reveals how the metadata and context associated with these images can perpetuate biases in generated outputs. For example, the dataset’s underrepresentation of black women, as reflected in the metadata and tagging, constrained the diversity of faces produced by StyleGAN 2, showcasing the power of digital photography’s semiotics in shaping AI-generated content.

Through this analysis, Salvaggio invites readers to contemplate the far-reaching implications of the unseen forces that shape the AI-generated world we encounter. 

A second part of the essay, to be published,  focuses on YFCC100M, a dataset of 99.2 million photos released in June 2014.  (Note: Salvaggio and I are both connected to The Flickr Foundation)

Part Two is coming; I’m excited to read more. 

But by serving as benchmarks and tools for calibration, datasets can also haunt infrastructures of Artificial Intelligence.”–Eryk Salvaggio

In a new blog post, Eryk Salvaggio, @cyberneticforests, a 2024 Flickr Foundation Research Fellow, explores where archives, datasets, and infrastructures intersect, shaping the nature of AI-generated images. Salvaggio introduces the concept of “ghosts” that haunt these synthetic creations—absences and decisions embedded in the training data and infrastructure that guide the final output.

Every image generated by AI calls up a line of ghosts. They haunt the training data, where the contexts of photographs are reduced to the simplest of descriptions. They linger in the decisions of engineers and designers in what labels to use. The ghosts that haunt the generated image are hidden by design, but we can find them through their traces. We just need to know how to look.”–Salvaggio

Salvaggio’s exploration of Flickr’s multifaceted role as an archive, dataset, and infrastructure illuminates a profound semiotic shift. As images evolve from individual context to scaled abstraction, their meaning and interpretation radically transform. They cease to be mere visual representations and become data points in a vast system. This shift fundamentally reshapes our understanding of their significance and potential influence on AI systems.

The author’s analysis of the FFHQ dataset, derived from 70,000 Flickr portraits, reveals how the metadata and context associated with these images can perpetuate biases in generated outputs. For example, the dataset’s underrepresentation of black women, as reflected in the metadata and tagging, constrained the diversity of faces produced by StyleGAN 2, showcasing the power of digital photography’s semiotics in shaping AI-generated content.

Through this analysis, Salvaggio invites readers to contemplate the far-reaching implications of the unseen forces that shape the AI-generated world we encounter. 

A second part of the essay, to be published,  focuses on YFCC100M, a dataset of 99.2 million photos released in June 2014.  (Note: Salvaggio and I are both connected to The Flickr Foundation)

Part Two is coming; I’m excited to read more.