From Syntax to Semantics: Unveiling the Emergence of Chirality in SMILES Translation Models

2026-05-11Machine Learning

Machine Learning
AI summary

The authors studied how chemical language models (CLMs) learn to understand chirality, which is important because molecules that are mirror images (enantiomers) can act very differently in drugs. They created a new model called Pan-CORE that translates SMILES strings and watched how it learns chiral information during training. They found that the model suddenly gets better at recognizing chirality after being stuck for a while, which means learning chirality is hard and not just about the model size. By analyzing internal parts of the model, the authors showed this improvement happens mainly in the encoder and involves changes in how the model pays attention to chiral details. Their work helps explain how chemical meaning emerges in these models and how to interpret them better.

Chemical language modelsChiralityEnantiomersSMILESTransformerEncoder-decoderAttention mechanismLatent spaceAutoregressive modelPharmacological activity
Authors
Zehao Li, Yasuhiro Yoshikai, Shumpei Nemoto, Hiroyuki Kusuhara, Tadahaya Mizuno
Abstract
Understanding how chemical language models (CLMs) learn chemical meaning from molecular string representations, rather than only surface-level string patterns, is an important question in chemical representation learning and machine learning for chemistry. Chirality provides a demanding test case: enantiomers can differ greatly in pharmacological activity and toxicity, yet CLMs often struggle to distinguish chiral configurations reliably. Here we present Pan-CORE (Pan-Chemical Omniscale Representation Engine), a family of autoregressive Transformer-based encoder-decoder models for SMILES translation, and use high-temporal-resolution checkpoint analysis to investigate how chiral information is learned during training. Across all tested Pan-CORE variants, we observe a reproducible jump-up in which chiral-token accuracy rises abruptly after a long plateau, suggesting that chiral learning stagnation is not explained by model capacity alone and instead reflects the complexity of chiral constraints. Analyses of attention dynamics, residual-stream trajectories, and latent-space geometry support an encoder-centered mechanism in which chiral-token representations undergo transient destabilization and reconstruction, seen as a V-shaped drop and recovery in vector norm and directional stability, together with a clear reorganization of chiral molecular representations in the latent space. Encoder-decoder cross-evaluation further supports the encoder-centered nature of the transition, and targeted attention-head ablation identifies a small set of chiral-sensitive heads whose removal selectively reduces chiral-token accuracy even in the fully trained model. These findings show that SMILES translation can serve as a useful experimental system for mechanistic analysis of semantic emergence in CLMs, with implications for interpretable chemical representation learning.