Evidence-Gated LLM Priors for Multi-Objective Bayesian Optimization

2026-06-01 • Artificial Intelligence

Artificial IntelligenceMachine Learning

AI summaryⓘ

The authors study how to better use large language models (LLMs) as experts in optimizing problems with multiple goals, where the LLM’s advice might be good for some goals but misleading for others. They propose a system that keeps track of how trustworthy each LLM expert is for each goal and updates these trust levels based on actual results over time. Their approach also includes a gate to decide whether to trust, partially trust, or ignore the LLM’s confidence in its advice. Testing on molecule optimization tasks showed their method improves robustness over blindly trusting the LLM, but raw confidence scores from the LLM are not always helpful. They also found that some strategies for choosing how much to trust the LLM need to consider the optimization process more carefully.

Large Language ModelsMulti-objective OptimizationBayesian OptimizationExpert PriorsReputation MechanismConfidence CalibrationMolecule OptimizationCounterfactual GateDiscrete Optimization

Authors

Jiangyu Chen, Banyi

Abstract

Large language models (LLMs) are increasingly used as heuristic advisors for black-box optimization, yet their suggestions and self-reported confidence are not necessarily calibrated to downstream objective values. This issue becomes more pronounced in multi-objective Bayesian optimization, where different objectives may require different expert knowledge and where an LLM expert can be useful for one objective but misleading for another. We study how to use LLM-generated expert priors in discrete multi-objective Bayesian optimization without blindly trusting them. We propose an objective-wise reputation-market mechanism that treats each expert-objective pair as a falsifiable prior source. Expert weights are updated online from observed objective feedback, discounted over time, and gated by market-level trust. We then introduce a decoupled counterfactual gate that can use the LLM prior without confidence, use it with confidence, or abstain from the LLM prior entirely. Across controlled synthetic stress tests and three molecule optimization benchmarks with \qwenflash{}-generated expert priors, we find that dynamic objective-wise calibration improves robustness over fixed LLM priors. However, raw LLM confidence is not reliably beneficial: on ESOL, confidence is positively correlated with prediction error; on FreeSolv, confidence can help; and on Lipophilicity, ignoring confidence remains strongest. Our fixed three-arm counterfactual gate improves over the first counterfactual variant on ESOL and FreeSolv, while an attempted margin portfolio exposes a useful negative result: margin selection should be acquisition-aware rather than based only on one-step prior error.

View PDFOpen arXiv