The DeepSpeak-Agentic Dataset

2026-06-02 • Artificial Intelligence

Artificial Intelligence

AI summaryⓘ

The authors introduce DeepSpeak-Agentic, a collection of videos totaling over 37 hours showing people talking with AI agents that have a virtual body. They use this dataset to test how well computers can tell if a voice, video, or text comes from an AI or a human. The dataset also helps study how humans interact with these AI agents and serves as a reference for improving AI models that create realistic voices and faces. The authors also created a system that automatically pairs humans with AI agents to record these conversations and separates each speaker's audio and video.

Embodied AIForensic identificationLarge-language modelsAI-generated voicesAI-generated facesHuman-agent interactionAudiovisual datasetCrowd workersData capture systemSemi-structured conversations

Authors

Sarah Barrington, Maty Bohacek, Hany Farid

Abstract

We present DeepSpeak-Agentic, a dataset of videos comprising over 37 hours of semi-structured conversations between a human and an embodied AI agent. We use this dataset to evaluate the automatic forensic identification (audio, video, or text) of AI agents, study the nature of human-agent interactions, and provide a benchmark for future advances in the large-language models and AI-generated voices and faces that power embodied AI agents. We also contribute a scalable data-capture system that creates agents, automatically pairs them with human crowd workers, records audiovisual conversations across specified scenarios, and identifies and separates the human and agent in the combined stream.

View PDFOpen arXiv