Bert-VITS2
vits2 backbone with multilingual-bert
Bert-VITS2 is an abandoned Python text-to-speech system that combines VITS2 speech synthesis with a multilingual BERT model for more natural-sounding voices across languages. The team now recommends their newer Fish-Speech project instead.
Bert-VITS2 is a text-to-speech system that combines two components: VITS2, a neural network architecture for generating speech audio from text, and a multilingual BERT model, which provides deeper language understanding to improve how the generated voice sounds. The idea is that BERT can better interpret the meaning and context of text, helping produce more natural-sounding output compared to running VITS2 alone.
VITS is an end-to-end speech synthesis approach that generates audio directly from text input. The VITS2 variant improved on the original, and Bert-VITS2 extends it further by feeding BERT embeddings into the model so it has a richer understanding of what it is being asked to say. The multilingual BERT component means the system can work across different languages without needing completely separate models for each one.
A web-based preprocessing interface is included (webui_preprocess.py) to help with preparing training data. Beyond pointing to that script, the README is brief and does not go into detailed usage instructions.
The README notes that this project is no longer actively maintained. The same team has released a newer project called Fish-Speech, which they describe as the current recommended replacement. Users starting fresh are advised to use Fish-Speech instead.
The project is written in Python. The README includes strict usage restrictions prohibiting use for any purpose that would violate Chinese law or for any political purpose.
Where it fits
- Generate natural-sounding speech audio from text in multiple languages using a pre-trained Bert-VITS2 model.
- Prepare your own voice training dataset using the included web preprocessing interface to create a custom voice.
- Study how combining BERT language embeddings with a VITS2 speech model improves prosody and naturalness.
- Use as a reference implementation before migrating to the team's actively maintained Fish-Speech replacement.