Deterministic Integrity Gates for LLM-Assisted Clinical Manuscript Preparation: An Auditable Biomedical Informatics Architecture

2026-06-08Artificial Intelligence

Artificial IntelligenceDigital Libraries
AI summary

The authors created a system called MedSci Skills that helps check the accuracy of clinical research papers written with AI. Their approach breaks the task into small steps, stops when problems are found, and uses simple, repeatable checks whenever possible. They tested it on public datasets and found it caught all intentional errors, while a standard AI reviewer missed many. This approach makes it easier for humans to verify AI-generated research, focusing on clear evidence rather than competing with human quality. The toolkit is openly available for others to use.

Large Language ModelsClinical Research ManuscriptsVerificationDeterministic ChecksReproducibilitySelf-contained SkillsReporting GuidelinesMedSci SkillsIntegrity GatesOpen-source Toolkit
Authors
Yoojin Nam, Jinhoon Jeong, Namkug Kim
Abstract
Objective. Large language models (LLMs) increasingly draft clinical research manuscripts, but their fluency can hide fabricated citations, numbers that drift from source tables, and unmet reporting-guideline items. Existing tools generate text without verifying it, and self-critique inherits the blind spots that produce confident fabrication. We describe an architecture that pairs generation with verification. Methods. The design rests on three principles: decompose the workflow into self-contained skills, gate every stage transition with halt-on-failure, and resolve each integrity question with the cheapest sufficient mechanism -- a deterministic, re-executable check where one suffices, and a prose-level probe only where interpretation is unavoidable. This determinism-where-possible split, organized as an integrity-gate taxonomy, is the core contribution. It is realized as MedSci Skills, an open-source toolkit of 43 skills coordinated by one orchestrator, whose deterministic tier comprises 21 standard-library detectors. We evaluate it on three reproducible public-dataset pipelines (STARD, PRISMA, STROBE) and a seeded-defect ablation. Results. Across the three pipelines every content-hash manifest verified clean and the gates surfaced real defects. On 27 identical injected defects the deterministic gates detected all 27 with no false positives on the matched clean fixtures, whereas a generic single-prompt LLM reviewer detected 11, its misses concentrated in generated-code, bibliography-internal, and style defects the prose does not expose. Conclusion. Determinism-where-possible verification yields an auditable, re-executable trail that exposes the evidence a human needs to check an LLM-assisted manuscript -- feasibility and reproducibility evidence, not a claim of human-competitive quality, which a separate blinded study addresses. MedSci Skills is MIT-licensed and archived (v3.8.0).