VoxCPM

Python ★ 31k updated 11d ago

VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning

VoxCPM is an open-source text-to-speech system that generates natural-sounding speech in 30 languages and can clone voices from short audio clips or create new voices from text descriptions.

PythonPyTorchHugging Facesetup: hardcomplexity 4/5

VoxCPM is a text-to-speech system — software that converts written text into spoken audio. Its main technical distinction is that it skips the usual step of breaking speech into discrete sound tokens, instead generating speech directly as continuous audio representations through an architecture that combines diffusion models with autoregressive generation. The project claims this approach produces more natural and expressive speech than tokenization-based systems.

The current version, VoxCPM2, is a 2-billion parameter model trained on over 2 million hours of multilingual audio data across 30 languages. Beyond standard text-to-speech, it supports three additional capabilities: Voice Design (describing a voice in plain text and having the model create it without any reference recording), Controllable Voice Cloning (copying someone's voice from a short audio clip while optionally adjusting the style), and Ultimate Cloning (reproducing every detail of a voice by providing both the reference audio and its transcript). Output is 48kHz audio.

Installation is via pip, and the model weights are available on Hugging Face. A Python API, command-line interface, and web demo are all provided. The model can run in real-time streaming mode and is released under the Apache 2.0 license, permitting commercial use.

Where it fits

Convert a written article into spoken audio in 30 languages without any voice recording setup.
Clone a speaker's voice from a short audio sample to narrate new content in their style.
Build a multilingual voice assistant or podcast tool using a single open-source model.
Design a custom synthetic voice by describing its characteristics in text, no reference audio needed.

Open on GitHub → Full breakdown on explaingit →