Confucius4-TTS

Python ★ 269 updated 3d ago

Confucius4-TTS: a Multilingual and Cross-Lingual Zero-Shot TTS Engine

Confucius4-TTS is a text-to-speech system from NetEase Youdao, a major Chinese technology company. It is designed to convert written text into spoken audio across 14 languages, including Chinese, English, Japanese, Korean, German, French, Spanish, Indonesian, Italian, Thai, Portuguese, Russian, Malay, and Vietnamese.

The defining capability of the system is cross-lingual voice cloning without accent. Given a short audio sample of a specific speaker, the system can generate speech in a completely different language while preserving the original speaker's voice characteristics, without requiring any additional training or a transcript of the reference audio. For example, you could provide a few seconds of a Chinese speaker and have the system read English text in that person's voice, sounding natural rather than accented.

The system also transfers emotional qualities from the reference audio, not just vocal tone. The architecture combines a speech encoder with a large language model, following a pattern that has become common in recent AI speech research.

Performance is documented in the README through benchmark tables comparing Confucius4-TTS against other systems including F5-TTS, CosyVoice, Spark-TTS, and ElevenLabs across multiple test sets. The metrics used are word error rate (how accurately the generated speech matches the intended words) and speaker similarity (how closely the voice matches the reference speaker). Results vary by language pair and benchmark.

As of the repository's last update, the source code and model weights have not yet been published. The README states they are under preparation. An online demo is available at the project's website where users can try the system without setting anything up locally. The project is published under the Apache 2.0 license.

Open on GitHub → Full breakdown on explaingit →