Yeti: A compact protein structure tokenizer for reconstruction and multi-modal generation

2026-05-11Artificial Intelligence

Artificial Intelligence
AI summary

The authors developed Yeti, a new method to convert complex 3D protein structures into simpler tokens that computers can understand for training machine learning models. Unlike previous methods that focused mainly on reconstructing structures, Yeti balances accuracy with the ability to generate new protein designs. They showed that Yeti uses its token space efficiently and achieves good reconstruction with fewer parameters. Using Yeti, the authors trained a model from scratch that can create realistic protein sequences and structures together, performing similarly to much larger models. This suggests Yeti is a useful tool for building compact models that work with multiple types of protein data.

protein structuretokenizertransformermultimodal learningquantizationcodebook utilizationsequence-structure cogenerationflow matchingESM3reconstruction accuracy
Authors
Nabin Giri, Steven Farrell, Kristofer E. Bouchard
Abstract
Multimodal models that jointly reason over protein sequences, structures, and function annotations within a unified representation hold immense potential for integrating multimodal data and generating new proteins with designed functional properties. To utilize transformer architectures, such models require a tokenizer that converts protein structure from continuous atomic coordinates into discrete representations suitable for scalable multimodal training. The quality of such models are fundamentally upper bounded by the fidelity and expressiveness of the underlying tokenized structure. However, existing tokenizers prioritize reconstruction over generative abilities. To address these gaps, we introduce Yeti, a simple and compact protein structure tokenizer based on lookup free quantization and trained end to end with a flow matching objective for multimodal learning. Compared to existing models, Yeti generally achieves the best codebook utilization and token diversity, and second best reconstruction accuracy (with 10x fewer parameters than ESM3) on diverse datasets. To validate Yeti's generative capability, we trained a compact multimodal model jointly over its structure tokens and amino acid sequence entirely from scratch, with no pretrained initialization. The resulting multimodal model generates plausible structures under unconditional cogeneration of protein sequence and structures, achieving comparable results to 10x larger models. Together, these results demonstrate that Yeti is a compact and expressive protein structure tokenizer suitable for training multimodal models that cogenerates highly plausible sequences and structures.