Qwen3-TTS

Python ★ 12k updated 3mo ago

Qwen3-TTS is an open-source series of TTS models developed by the Qwen team at Alibaba Cloud, supporting stable, expressive, and streaming speech generation, free-form voice design, and vivid voice cloning.

Qwen3-TTS is a set of open-source AI text-to-speech models from Alibaba that convert text to natural speech in 10 languages, with voice cloning, text-described voice styles, and streaming output starting in under 100 milliseconds.

PythonPyTorchvLLMsetup: hardcomplexity 3/5

Qwen3-TTS is a collection of open-source text-to-speech models built by the Qwen team at Alibaba Cloud. The models take written text as input and produce spoken audio as output, covering ten languages: Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian. Several regional dialect voice profiles are also included.

The collection ships multiple model variants tuned for different tasks. One variant lets you describe a voice in plain text (age, gender, accent, emotion) and the model generates audio in that style. Another variant clones an existing voice from a short three-second audio sample, so you can reproduce a specific speaker's sound. A third variant offers nine pre-built premium voices with controllable style. All variants support streaming output, meaning audio can start playing almost immediately rather than waiting for the full clip to render.

The README highlights a latency figure of 97 milliseconds from the moment text arrives to the first audio packet being sent out. The underlying architecture avoids some common two-stage designs (a language model feeding a separate diffusion model) in favor of a single end-to-end approach, which the team says reduces errors that can creep in when two separate systems are chained together.

Two model sizes are available: 0.6B and 1.7B parameters. Smaller models run faster and need less hardware; larger models generally produce higher-quality or more controllable output. The models can be loaded through the qwen-tts Python package or through vLLM, a popular high-throughput inference server. Fine-tuning on custom data is also supported for teams that need a specialized voice style. A hosted API is available via Alibaba Cloud for those who do not want to run the models locally.

The repository includes a local web demo, code examples for each major use case, and links to model weights on Hugging Face and ModelScope. The full README is longer than what was shown.

Where it fits

Generate spoken audio from text in 10 languages for a voice assistant, podcast tool, or accessibility feature
Clone a specific speaker's voice from a 3-second audio sample to produce personalized speech output
Describe a voice in plain text (age, gender, accent, emotion) and generate matching speech without a pre-recorded sample
Stream real-time text-to-speech into an application with under 100ms latency to the first audio packet

Open on GitHub → Full breakdown on explaingit →