ScheMatiQ: From Research Question to Structured Data through Interactive Schema Discovery

2026-04-10Computation and Language

Computation and Language
AI summary

The authors developed ScheMatiQ, a tool that helps answer complex questions from large collections of documents by automatically creating a structured format to organize information. Instead of manually designing how to label data, ScheMatiQ uses a large language model to generate schemas and databases from the documents. It also offers a web interface for experts to guide and correct the results. The authors tested it with professionals in law and biology, showing it can support real research tasks. They provide ScheMatiQ openly for others to use and improve.

large language modelschemainformation extractionnatural language processingannotationstructured databasetext mininghuman-in-the-looplegal informaticscomputational biology
Authors
Shahar Levy, Eliya Habba, Reshef Mintz, Barak Raveh, Renana Keydar, Gabriel Stanovsky
Abstract
Many disciplines pose natural-language research questions over large document collections whose answers typically require structured evidence, traditionally obtained by manually designing an annotation schema and exhaustively labeling the corpus, a slow and error-prone process. We introduce ScheMatiQ, which leverages calls to a backbone LLM to take a question and a corpus to produce a schema and a grounded database, with a web interface that lets steer and revise the extraction. In collaboration with domain experts, we show that ScheMatiQ yields outputs that support real-world analysis in law and computational biology. We release ScheMatiQ as open source with a public web interface, and invite experts across disciplines to use it with their own data. All resources, including the website, source code, and demonstration video, are available at: www.ScheMatiQ-ai.com