The DeepSpeak-Agentic Dataset

2026-06-02Artificial Intelligence

Artificial Intelligence
AI summary

The authors introduce DeepSpeak-Agentic, a collection of videos totaling over 37 hours showing people talking with AI agents that have a virtual body. They use this dataset to test how well computers can tell if a voice, video, or text comes from an AI or a human. The dataset also helps study how humans interact with these AI agents and serves as a reference for improving AI models that create realistic voices and faces. The authors also created a system that automatically pairs humans with AI agents to record these conversations and separates each speaker's audio and video.

Embodied AIForensic identificationLarge-language modelsAI-generated voicesAI-generated facesHuman-agent interactionAudiovisual datasetCrowd workersData capture systemSemi-structured conversations
Authors
Sarah Barrington, Maty Bohacek, Hany Farid
Abstract
We present DeepSpeak-Agentic, a dataset of videos comprising over 37 hours of semi-structured conversations between a human and an embodied AI agent. We use this dataset to evaluate the automatic forensic identification (audio, video, or text) of AI agents, study the nature of human-agent interactions, and provide a benchmark for future advances in the large-language models and AI-generated voices and faces that power embodied AI agents. We also contribute a scalable data-capture system that creates agents, automatically pairs them with human crowd workers, records audiovisual conversations across specified scenarios, and identifies and separates the human and agent in the combined stream.