Benchmarking System Dynamics AI Assistants: Cloud Versus Local LLMs on CLD Extraction and Discussion

2026-04-20Artificial Intelligence

Artificial IntelligenceHuman-Computer InteractionMachine Learning
AI summary

The authors evaluated various large language models, both cloud-based and locally run, on tasks related to understanding and discussing system dynamics using two benchmarks. They found cloud models generally performed better on causal diagram extraction, while local models showed mixed results on discussion tasks, particularly struggling with error fixing due to memory limits. They also studied how different model types, backends, and quantization levels affected performance, discovering that the choice of backend had a bigger impact than quantization. Additionally, they shared practical guidance for running large models on Apple Silicon hardware. Their work helps clarify strengths and weaknesses of different models and setups for system dynamics AI assistance.

System DynamicsCausal Loop DiagramLarge Language ModelsInstruction-TuningQuantizationBackendJSON SchemaZero-ShotApple SiliconModel Evaluation
Authors
Terry Leitch
Abstract
We present a systematic evaluation of large language model families -- spanning both proprietary cloud APIs and locally-hosted open-source models -- on two purpose-built benchmarks for System Dynamics AI assistance: the \textbf{CLD Leaderboard} (53 tests, structured causal loop diagram extraction) and the \textbf{Discussion Leaderboard} (interactive model discussion, feedback explanation, and model building coaching). On CLD extraction, cloud models achieve 77--89\% overall pass rates; the best local model reaches 77\% (Kimi~K2.5~GGUF~Q3, zero-shot engine), matching mid-tier cloud performance. On Discussion, the best local models achieve 50--100\% on model building steps and 47--75\% on feedback explanation, but only 0--50\% on error fixing -- a category dominated by long-context prompts that expose memory limits in local deployments. A central contribution of this paper is a systematic analysis of \textit{model type effects} on performance: we compare reasoning vs.\ instruction-tuned architectures, GGUF (llama.cpp) vs.\ MLX (mlx\_lm) backends, and quantization levels (Q3 / Q4\_K\_M / MLX-3bit / MLX-4bit / MLX-6bit) across the same underlying model families. We find that backend choice has larger practical impact than quantization level: mlx\_lm does not enforce JSON schema constraints, requiring explicit prompt-level JSON instructions, while llama.cpp grammar-constrained sampling handles JSON reliably but causes indefinite generation on long-context prompts for dense models. We document the full parameter sweep ($t$, $p$, $k$) for all local models, cleaned timing data (stuck requests excluded), and a practitioner guide for running 671B--123B parameter models on Apple~Silicon.