hibiki-zero
A real-time and multilingual speech translation model
Hibiki-Zero is a real-time speech translation model that listens to someone speaking in French, Spanish, Portuguese, or German and produces English speech output with low delay. Unlike text-based translation pipelines that convert speech to text then translate then synthesize speech, Hibiki-Zero is designed to work end-to-end with low latency and to preserve the voice characteristics of the original speaker.
It ships as a Python package with a built-in web server. You start the server, open the URL it shows you in a browser, and can immediately start speaking into your microphone to hear the translated output in near real time. It also supports batch processing of existing audio files. A public tunnel option lets you share the server with others over the internet without additional configuration.
You would use this if you need live speech translation for meetings, presentations, or interviews where speakers use one of the four supported input languages. The model is 3 billion parameters in size and requires an Nvidia GPU with at least 8 GB of video memory to run. Installation is a single command.