A Pocket Offline Model for Simultaneous Speech Translation as CUNI Submission to IWSLT 2026

2026-06-02Computation and Language

Computation and Language
AI summary

The authors created a translation system called Canary that can translate spoken language directly into text in another language as the speech happens. They used a method called AlignAtt to decide when to translate portions of speech, making the system both accurate and fast. Their model works with 25 languages and is smaller than many others, which means it uses less computing power but still performs well. They tested it with Czech to English and English to German and Italian translations.

simultaneous translationspeech-to-text translationAlignAttCanary modelIWSLTlow-latency translationmultilingual modelsparameter sizeoffline translationcomputational efficiency
Authors
Aziz Sharipov Ortega, Dominik Macháček
Abstract
We implement simultaneous translation capability with the offline direct speech-to-text translation model Canary, using the state-of-the-art policy AlignAtt, and submit it to IWSLT 2026 Simultaneous Speech Translation Shared task for Czech to English and English to German and Italian. The strengths of our system are: (1) high translation quality, outperforming similarly sized baselines both in low- and high-latency regimes in computationally unaware simulations; (2) low computational requirements, as the model has only 1B parameters; (3) multilinguality -- support of 25 source and 25 target languages.