OmniVoice
High-Quality Voice Cloning TTS for 600+ Languages
OmniVoice is a text-to-speech system that converts written text into spoken audio while supporting over 600 languages. Its main capability is zero-shot voice cloning: you provide a short audio recording (3 to 10 seconds) of someone speaking, and OmniVoice will generate new speech in that same person's voice without any additional training.
There are three ways to use it. Voice cloning requires a reference audio clip and an optional transcript of what is being said in that clip; the model reads the new text in the cloned voice. Voice design lets you describe the voice you want using text attributes (female, low pitch, British accent, child, whisper, and so on) without needing a reference recording at all. The third mode, called Auto Voice, picks a voice on its own if you provide neither a reference clip nor a description.
Beyond basic voice generation, OmniVoice supports inline non-verbal sounds inserted directly into the text. You can write tags like [laughter] or [sigh] inside a sentence, and the model will produce those sounds at the appropriate moment. There is also pronunciation control for Chinese text using pinyin notation.
The system is built on a diffusion language model architecture. The README states inference runs at a real-time factor of 0.025, meaning it can produce 40 seconds of audio for every second of compute time. Output audio is at 24 kHz.
Installation is via pip or uv. For people who prefer not to write code, there is a local web interface you can launch with a single command, a hosted demo on HuggingFace Spaces, and a Google Colab notebook. The pretrained model weights are available on HuggingFace. The package runs on NVIDIA GPUs and Apple Silicon.