2025 to now · personal research now building

Hibiki LT ↔ EN: real-time speech translation.

Teaching a simultaneous speech translation model a language pair it has never heard: Lithuanian and English. On a laptop.

What it is

Kyutai's Hibiki translates speech to speech in real time, but only for the language pairs it was trained on. Lithuanian is not one of them. This project adapts Hibiki to LT ↔ EN end to end: dataset construction, synthetic speech generation, a custom fine-tuning fork, and fully on-device inference.

Everything runs locally on Apple Silicon through MLX. No cloud, no API, no audio leaving the machine. Current throughput is around 12.5 tokens per second on an M4, real-time factor 1.41x in batch mode.

The work

Curated 44,000 TTS-quality Lithuanian-English sentence pairs from the OPUS-100 corpus (1M raw pairs)
Built a synthetic stereo speech pipeline: source language on one channel, aligned translation on the other
Forked and adapted moshi-finetune to support Hibiki's 17-stream token layout (text, target audio, source audio)
Verified the Mimi neural audio codec holds up against Lithuanian phonemes
Serving via a FastAPI web app and a real-time CLI translator

Concepts

Speech-to-speech Fine-tuning MLX Mimi codec Apple Silicon FastAPI Data curation

What it taught me

Adapting a frontier model to a low-resource language is 10% modelling and 90% building the dataset the model wishes existed.