VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA

2026-06-15Computer Vision and Pattern Recognition

Computer Vision and Pattern RecognitionArtificial Intelligence
AI summary

The authors created VinQA, a dataset to help AI models answer questions about documents that include both text and images like charts or photos, aiming to make the answers include references to these visuals. They tested two ways for the AI to understand document pages: one that looks at the whole page with marked image areas, and another that separates text and images before processing. They also developed a testing system called M-GroSE to check how complete and accurate the AI's answers are, especially regarding how well it cites images. Their results show that while top commercial models perform best, fine-tuned open models improve a lot, and both image-processing methods become effective after training. Finally, they confirmed that the AI puts images in the right spots with accurate supporting text in its answers.

multimodal large language modelsdocument question answeringlong-form answer generationvisual element citationpage encodingmodality encodingM-GroSE evaluationVisual Source F1fine-tuningVisual G-Eval
Authors
Young Rok Jang, Hyesoo Kong, Kyunghwan An, Jae Sub Huh, Gyeonghun Kim, Stanley Jungkyu Choi
Abstract
Real-world documents combine text with tables, charts, photographs, and diagrams arranged in diverse layouts, yet existing research on multimodal large language models (MLLMs) for document QA predominantly produces text-only responses, underutilizing these visual elements. We introduce VinQA, a dataset for long-form answer generation where cited visual elements are explicitly interleaved with their supporting text and grounded in relevant document pages. To support this task, we study two encoding methods for feeding raw document page images into an MLLM, along with their visual-element citation mechanisms: (1) Page Encoding, which directly encodes full-page images with bounding boxes of visual elements and treats these boxed regions as citable units; and (2) Modality Encoding, which parses each page to extract text and crop visual elements, encodes them separately, and uses these cropped elements as citable units. In our experiments, we propose M-GroSE, a multimodal evaluation framework extending GroUSE to assess answers along four dimensions: completeness, answer relevancy, faithfulness, and unanswerability. We additionally report Visual Source F1 to directly measure visual citation accuracy. Although proprietary frontier models still achieve the best overall scores on the VinQA test split, fine-tuning open Qwen2.5-VL models on the training split substantially improves their performance and narrows this gap. Modality Encoding is initially more robust for complex documents with long text, many visual elements, and diverse citation requirements. After training on VinQA, however, Page Encoding reaches a comparable level, competing effectively even without the explicit parsing used in Modality Encoding. Finally, Visual G-Eval, an MLLM-based judge, confirms that fine-tuned models insert visual elements at semantically appropriate positions with faithful supporting text.