Controllable Sim Agents with Behavior Latents

2026-07-02Robotics

RoboticsMachine Learning
AI summary

The authors developed a traffic simulation model called CNeVA that can mimic real driving behavior while allowing users to control specific driving traits like speed or safety. They introduced techniques to better learn from limited reward signals by using smooth eligibility gates instead of strict cutoffs. Tested on a real-world driving dataset, their model performs realistically and offers control options that other models do not. They also show that controlling different behaviors needs careful checks to avoid misleading results from the model gaming the reward system.

traffic simulationvariational agentslatent variable modelstrajectory generationreward signalseligibility gatesclassifier-free guidanceimitation learningWaymo Open Motion Dataset
Authors
Juanwu Lu, Junyu Zhu, Ziran Wang
Abstract
Realistic traffic simulation requires agents that imitate logged behavior and can also be steered along interpretable axes. Such controllability enables engineers to isolate variables, reproduce specific edge cases, and test autonomous systems without real-world risk. We introduce Controllable Neural Variational Agents (CNeVA), a controllable simulated-agent framework that learns to infer a per-agent Gaussian behavior latent from per-channel discounted returns via a closed-form conjugate variational update, conditioning a rectified-flow trajectory generator trained on a mixed channel-mask curriculum for classifier-free guidance. To tackle scarcity in reward signals, we propose soft eligibility gates that replace hard binary thresholds with smooth exponential decay, preserving the gradient signal for near-threshold agents. On the Waymo Open Motion Dataset, CNeVA attains competitive realism on the benchmark while exposing per-channel controllability that the higher-ranked imitation models lack. Speed- and acceleration-based steering produces monotone responses without stall-induced reward hacking. Safety controllability is monotone and substantial with the introduction of soft eligibility. We manage to achieve steerable map compliance under a context-residual return measure. Furthermore, our experiment demonstrates that steering metrics must be read alongside physical-plausibility guardrails to avoid reward-hacking confounds.