A Registry-Bound LLM Pipeline for Evidence-Grounded Trait Extraction across Tropical Plants, Aquatic Species, and Exotic Pets

2026-05-31Computation and Language

Computation and Language
AI summary

The authors created a system using large language models (LLMs) to automatically extract and organize information about traits of tropical plants, aquatic animals, and pets from a large encyclopedia. Their system ensures the data is reliable by using a strict list of trait categories, attaching exact quotes from sources as evidence, labeling confidence levels, and keeping past data versions. They processed data for over 400,000 species and stored more than 5 million trait records, most with high confidence. They performed several checks to verify the quotes matched source texts and that the extracted traits made sense, but they do not claim every record is perfect without human review. The main contribution is their four-part method for making LLM-extracted data auditable and trustworthy.

large language modelstrait extractiontropical speciesdata provenanceconfidence labelingevidence groundingdata validationschemaversion controlnatural language processing
Authors
Jeff Wang
Abstract
We describe a registry-bound large-language-model extraction pipeline producing evidence-grounded structured trait records at scale, on cultivated tropical plant, aquatic, and pet species. Four mechanisms render LLM-derived rows auditable: a versioned 39-key closed-vocabulary trait registry constraining every admitted value to a typed schema; a per-row verbatim evidence quote tying each value to source text; a per-row confidence label (high or medium; low dropped pre-persist); and multi-version preservation. Applied to 409,880 publishable species from the Tropical Species Encyclopedia, the pipeline executed 706,220 runs and persisted 5,489,881 trait records across 409,820 species (99.985%), 81.57% at high confidence. We report three validation layers in descending evidentiary strength: at full population, 90.12% of 5,427,588 evidence-bearing rows have their quote as a verbatim source substring (93.49% excluding one compliance meta-trait); a quote-supports-value audit on n=100 stratified non-red-zone rows yielded 100/100 (lower bound 96.30%); face-validity on n=50 red-zone rows yielded 50/50 Accept (lower bound 92.86%). Per-record correctness is not claimed; 100% pending human curation. The contribution is the four-mechanism framework.