Double Triangle Annotation: A Scalable Human-in-the-Loop Framework for High-Precision Historical Document Annotation

2026-05-25Computation and Language

Computation and Language
AI summary

The authors designed a new method called Double Triangle Annotation to help extract information from old historical documents accurately and efficiently. Their approach uses two different AI models to label documents separately; if both agree on a label, it is accepted automatically, but if not, humans step in to decide. They add a second layer that checks outputs again, sending only the hardest cases to experts. This way, most work is automated while keeping errors very low. They tested this on old French medical directories and created a reliable data set for future research.

structured information extractionhistorical documentsMultimodal Large Language Modelshuman-in-the-loopcross-model consensusWord Error Rateannotation frameworkdata labelingautomated pipelinesbenchmark dataset
Authors
Yi Ren
Abstract
Evaluating structured-information extraction from historical documents at scale requires high-precision ground-truth annotations, yet traditional manual labeling is expensive and fully automated pipelines built on large language models are prone to hallucination. We propose Double Triangle Annotation, a two-layer human-in-the-loop framework that leverages cross-model consensus to automate the majority of annotation work while ensuring high-precision outputs. In the first layer, two architecturally independent Multimodal Large Language Models annotate each document in parallel; when they agree, the label is auto-accepted, and disagreements are routed to a human jury. A second layer cross-checks two such systems against each other, escalating residual conflicts to a domain expert. The framework rests on a single assumption -- error independence between models -- requires no distributional priors or task-specific calibration, and becomes more autonomous as model capability improves. On the Guides Rosenwald, a corpus of French medical directories spanning 1887-1906, the framework achieves a final Word Error Rate of 0.003. Applied at scale, model consensus auto-accepts over 85% of 13,595 fields. We release the resulting benchmark -- the first structured-extraction ground truth for the Rosenwald Guides -- to support future work on historical document processing.