Translating Natural Language to Strategic Temporal Specifications via LLMs
2026-06-29 • Multiagent Systems
Multiagent SystemsArtificial Intelligence
AI summaryⓘ
The authors developed a way to turn natural language descriptions of strategic goals into formal logic formulas needed to check multi-agent systems. They created a new dataset for training models to do this translation since none existed before. Their fine-tuned models perform as well as strong commercial tools but allow processing data locally. They also combined their tool with a model checker so people without expert knowledge can specify complex strategic properties in plain language. The authors noticed that more powerful models are less reliable at judging translation quality compared to simpler ones.
Multi-Agent Systemsformal specificationsstrategy logicsATL/ATL*Large Language Modelssemantic accuracymodel checkingnatural language processing
Authors
Marco Aruta, Francesco Improta, Vadim Malvone, Aniello Murano, Vladana Perlic
Abstract
A rigorous formalization of system requirements is a fundamental prerequisite for the verification of Multi-Agent Systems (MAS). However, writing correct formal specifications is well known as an error-prone, time-consuming, and expertise-intensive task. This difficulty is further accentuated in MAS, where requirements must capture strategic abilities and temporal objectives. At present, there is no established methodology for deriving MAS specifications from natural language. We present a framework for translating Natural Language descriptions of strategic requirements into well-formed ATL/ATL* formulas using Large Language Models (LLMs). Since no available dataset supports supervised learning for the NL-to-ATL/ATL* translation task, we create and curate a novel expert-validated dataset, employed for training and evaluating fine-tuned models. On a held-out test set, evaluated under the LLM judge that best agrees with expert annotations, in-domain fine-tuning of small open-weight models (3 - 7B parameters) matches strong few-shot proprietary API baselines. Our best fine-tuned system reaches 0.84 semantic accuracy, statistically on par with 0.86 for the strongest few-shot proprietary baseline, while keeping requirements on-premises. We further find that judge reliability is inverse to generator strength. The open-weight Llama-3.3-70B tracks human verdicts most closely, whereas the strongest proprietary models are the least reliable judges, over-rejecting faithful paraphrases of the reference. To assess the practical applicability of the generated specifications, we embed our tool to an existing strategic logics model checker, enabling non-expert users to specify strategic properties in natural language.