Omni-Persona: Systematic Benchmarking and Improving Omnimodal Personalization

2026-05-11 • Computer Vision and Pattern Recognition

Computer Vision and Pattern Recognition

AI summaryⓘ

The authors created Omni-Persona, a new benchmark to test how well AI models can personalize responses using text, images, and audio together. They introduced a way to measure if models correctly use personal information or appropriately say they don’t know when the information is missing. Their tests showed that current models often struggle more with audio than images, bigger models don’t always perform better, and some training methods improve reliability but can reduce creativity. This work helps highlight the challenges in building AI that understands multiple types of data and personal context accurately.

multimodal modelspersonalizationgroundingbenchmarkcalibrated accuracycross-modal routingreinforcement learning from verified responses (RLVR)supervised fine-tuning (SFT)absent-personalarge language models

Authors

Yeongtak Oh, Dongwook Lee, Sangkwon Park, Heeseung Kim, Sungroh Yoon

Abstract

While multimodal large language models have advanced across text, image, and audio, personalization research has remained primarily vision-language, with unified omnimodal benchmarking that jointly covers text, image, and audio still limited, and lacking the methodological rigor to account for absent-persona scenarios or systematic grounding studies. We introduce Omni-Persona, the first comprehensive benchmark for omnimodal personalization. We formalize the task as cross-modal routing over the \emph{Persona Modality Graph}, encompassing 4 task groups and 18 fine-grained tasks across ${\sim}750$ items. To rigorously diagnose grounding behavior, we propose \emph{Calibrated Accuracy ($\mathrm{Cal}$)}, which jointly rewards correct grounding and appropriate abstention, incorporating absent-persona queries within a unified evaluation framework. On our dedicated experiments, three diagnostic findings emerge: (i) open-source models show a consistent audio-vs-visual grounding gap that RLVR partially narrows via dense rule-based supervision; (ii) answerable recall and parameter scale are incomplete diagnostics, since strong recall can coexist with absent-persona hallucination and larger models do not always achieve higher $\mathrm{Cal}$, exposing calibration as a separate evaluation axis; and (iii) SFT is bounded by the difficulty of constructing annotated ground-truth supervision at scale, while RLVR generalizes more consistently through outcome-level verifiable feedback yet drifts toward conservative behavior and lower generation quality under our reward design. Omni-Persona thus serves as a diagnostic framework that surfaces the pitfalls of omnimodal personalization, guiding future post-training and reward design.

View PDFOpen arXiv