Hibiki LT ↔ EN: real-time speech translation.
Teaching a simultaneous speech translation model a language pair it has never heard: Lithuanian and English. On a laptop.
What it is
Kyutai's Hibiki translates speech to speech in real time, but only for the language pairs it was trained on. Lithuanian is not one of them. This project adapts Hibiki to LT ↔ EN end to end: dataset construction, synthetic speech generation, a custom fine-tuning fork, and fully on-device inference.
Everything runs locally on Apple Silicon through MLX. No cloud, no API, no audio leaving the machine. Current throughput is around 12.5 tokens per second on an M4, real-time factor 1.41x in batch mode.
The work
- Curated 44,000 TTS-quality Lithuanian-English sentence pairs from the OPUS-100 corpus (1M raw pairs)
- Built a synthetic stereo speech pipeline: source language on one channel, aligned translation on the other
- Forked and adapted moshi-finetune to support Hibiki's 17-stream token layout (text, target audio, source audio)
- Verified the Mimi neural audio codec holds up against Lithuanian phonemes
- Serving via a FastAPI web app and a real-time CLI translator
Concepts
What it taught me
Adapting a frontier model to a low-resource language is 10% modelling and 90% building the dataset the model wishes existed.