Vision-language Models for Driver Monitoring Systems: A Driver Activity Description Dataset
2026-06-01 • Computer Vision and Pattern Recognition
Computer Vision and Pattern Recognition
AI summaryⓘ
The authors studied how computer models that understand text and images can recognize small but important driver actions. They found that existing models, trained on general data, had trouble describing detailed driver behaviors accurately. To fix this, they created a new dataset with detailed descriptions of driver actions and retrained the models. Their improved model performed better on another driving dataset, showing that training with detailed descriptions helps the models understand driver behavior more precisely. They also emphasize the need for more varied datasets to improve these systems further.
vision-language modelsdriver monitoring systemsDrive&Act datasetfine-grained action recognitionLLM-based scoringzero-shot learningcross dataset evaluationDriver Monitoring Dataset (DMD)model fine-tuningdataset creation
Authors
David J. Lerch, Sarath Mulugurthi, Manuel Martin, Frederik Diederichs, Rainer Stiefelhagen
Abstract
Understanding subtle driver actions is essential for building reliable driver monitoring systems. Existing visionlanguage models (VLMs) are trained on general datasets and struggle to recognize fine distinctions in driver behaviors. This paper addresses this limitation by creating a detailed natural language version of the Drive&Act dataset. We evaluate three VLMs on our new benchmark using LLM-based scoring methods. Their performance on the new benchmark shows that they cannot reliably generate accurate fine-grained driver activity descriptions. Based on the labeled Drive&Act dataset we create a new Drive&Act description dataset containing finegrained descriptions to train VLMs on driver activity understanding. Cross dataset evaluation on the Driver Monitoring Dataset (DMD) shows that the VLM fine-tuned on our new Drive&Act description dataset generalizes well to actions in the DMD dataset. The VLM fine-tuned on our Drive&Act description dataset achieves an ACCR score of 76 outperforming the zero-shot VLM baseline with an ACCR score of 66. These findings demonstrate that adapting VLMs with richly described driver actions can significantly improve their ability to interpret driver behavior while also highlighting the need for more diverse datasets to support broader generalization in future applications. Our Drive&Act description dataset and code will be publicly available on GitHub.