ENSEMBITS: an alphabet of protein conformational ensembles
2026-05-13 • Machine Learning
Machine LearningArtificial Intelligence
AI summaryⓘ
The authors created Ensembits, a tool that can summarize how proteins move and change shape over time, instead of just capturing one fixed structure. Ensembits handles challenges like different sizes of protein groups and sparse movement data by using a special learning method. It performs better than previous tools at predicting protein motion and matches or beats others in various tasks, even with less training data. Importantly, Ensembits can predict dynamic behavior from just a single protein structure, making it useful for understanding protein flexibility in language models and design.
Protein structureTokenizersConformational ensemblesMolecular dynamicsResidual VQ-VAERMSF predictionProtein language modelingStructure-function relationshipZero-shot predictionProtein dynamics
Authors
Kaiwen Shi, Carlos Oliver
Abstract
Protein structure tokenizers (PSTs) are workhorses in protein language modeling, function prediction, and evolutionary analysis. However, existing PSTs only capture local geometry of static structures, and miss the correlated motions and alternative conformational states revealed by protein ensembles. Here we introduce Ensembits, the first tokenizer of protein conformational ensembles. Ensembits address challenges inherent to tokenizing dynamics: deriving informative geometric descriptors across conformations, permutation-invariance encoding of variable-size ensembles, and conquering sparsity in dynamics data. Trained with a Residual VQ-VAE using a frame distillation objective on a large molecular dynamics corpus, Ensembits outperforms all related methods on RMSF prediction, and is the strongest standalone structural tokenizer on an token-conditioned ANOVA test on per-residue motion amplitude. Ensembits further matches or exceeds static tokenizers on EC, GO, binding site/affinity prediction, and zero-shot mutation-effect prediction despite using far less pretraining data. Notably, the distillation objective enables Ensembits to predict dynamics token from one single predicted structure, which alleviates dynamics data sparsity. As the field moves from static structure prediction toward ensemble generation, Ensembits offer the discrete vocabulary needed to bring dynamics into protein language modeling and design.