ViZipvoice
A Vietnamese text-to-speech model that clones a speaker's voice from a short audio clip and generates natural-sounding Vietnamese speech in that voice from any text you provide.
ViZipVoice is a Vietnamese text-to-speech model built by fine-tuning ZipVoice, an existing open-source speech synthesis system. Its main capability is zero-shot voice cloning: you give it a short audio recording of a speaker along with a text transcript of that recording, and then it can generate new Vietnamese speech in that same voice from any text you provide.
The model was trained on roughly 7,000 hours of audio, about 6,500 hours of Vietnamese and 500 hours of English. It works at a 24 kHz sample rate and uses a character-level tokenizer with 244 tokens covering Vietnamese characters, including all accented forms, digits, and punctuation. Rather than converting text to phonemes first, the system maps characters directly. A Vietnamese text normalization step runs automatically before synthesis to convert numbers, dates, abbreviations, and units into spoken form.
The model weights are hosted on Hugging Face and download automatically when you run the tool. There are 30 sample reference audio files included in the Hugging Face repository, each paired with a transcript file, which you can use directly as voice prompts without recording your own sample. Demo outputs generated with the model are also included.
You can use ViZipVoice through three interfaces. The command-line tool takes a prompt audio file, its transcript, and the target text, then writes an output WAV file. A Gradio web interface lets you select from the included reference speakers and type text to synthesize through a browser. A Python wrapper class is available for integrating the model into your own code.
The README covers installation from source using pip, quality tips for prompt audio (clean recording, correct transcript, one speaker, minimal background noise), and parameters for controlling synthesis speed, number of diffusion steps, and audio postprocessing such as crossfade and silence between segments.
Where it fits
- Generate Vietnamese voiceovers in a specific person's voice using a short reference audio clip.
- Build a Vietnamese audiobook reader by passing chapter text to the Python wrapper class.
- Use the Gradio web interface to demo Vietnamese TTS with one of the 30 included reference speakers.