dots.tts

Python ★ 745 updated 5d ago

dots.tts is an open-source text-to-speech system that converts written text into spoken audio. It is released by rednote-hilab and can clone a person's voice from a short audio sample, produce speech in 24 languages, and generate high-quality audio at 48 kHz, which is the same sample rate used for music distribution. The model has about 2 billion internal parameters and is available under the Apache 2.0 license, meaning anyone can use or modify it freely.

The voice cloning feature works in two modes. In the first mode you provide a reference audio clip and its exact transcript, and the model matches both the sound and the speaking rhythm of that voice. In the second mode you provide only the audio clip without a transcript, and the model copies the general timbre of the voice without matching its rhythm. There is also a basic mode that generates speech from a generic voice, though this is primarily useful when you have fine-tuned the model on a specific speaker.

You can use the system through a command-line tool, a Python library, or a browser-based demo built with a tool called Gradio. The command-line tool takes a text string and an optional reference audio file and writes the result to a WAV file. The Python library provides the same functionality for use inside your own code. The Gradio demo launches a local web page where you can type text and hear the output without writing any code.

Fine-tuning is supported, which means you can train the model further on your own audio data to specialize it for a particular voice or style. The repository includes configuration files and a script to prepare training data from a publicly available speech dataset, so new users have a working example to follow.

On standard benchmarks for speech quality and speaker similarity, dots.tts reports results that the authors describe as state-of-the-art among publicly available systems at the time of release in June 2026.

Open on GitHub → Full breakdown on explaingit →