VoiceCraft

Jupyter Notebook ★ 8.5k updated 21d ago

Zero-Shot Speech Editing and Text-to-Speech in the Wild

VoiceCraft is an AI model that clones a person's voice from a short audio sample, then generates new spoken words or edits existing recordings to sound like that person, working on real-world audio like podcasts and audiobooks.

PythonPyTorchCUDAJupyter NotebookDockerGradiosetup: hardcomplexity 4/5

VoiceCraft is an AI system that can edit existing speech recordings or generate new speech from text, using only a short sample of a person's voice as a reference. If you give it a few seconds of audio, it can produce new spoken words that sound like the same person, or it can modify what was already said in a recording. The project describes this as working on real-world audio sources like audiobooks, YouTube videos, and podcasts, not just controlled studio recordings.

The underlying approach is a type of AI model that works by predicting missing pieces of audio, similar to how some text AI models fill in blanks within a sentence. Two model sizes are available on HuggingFace: a 330 million parameter version and a larger 830 million parameter version, with enhanced variants of both released in April 2024.

There are several ways to try it. The easiest is a Google Colab notebook that runs in a browser without any local installation. A Docker-based option is also available for those comfortable with containers. For local installation, setup requires Python, conda, and a CUDA-capable NVIDIA graphics card. The setup process installs a number of audio processing libraries and a forced-alignment tool called Montreal Forced Aligner, which helps the model match text to the timing of audio. A Gradio web interface can be run locally or accessed through HuggingFace Spaces.

The repository includes Jupyter notebooks for both text-to-speech inference and speech editing, plus command-line scripts for integrating the model into other projects. Training and finetuning instructions are also included for those who want to adapt the model to different voices or datasets.

This is a research project backed by a published academic paper. It is primarily aimed at researchers and developers working in audio, though the Colab and HuggingFace demos make it accessible to anyone curious about AI voice generation.

Where it fits

Clone a speaker's voice from a podcast clip and generate new sentences that sound like that person
Edit an audiobook recording to fix a mispronounced word without re-recording the whole passage
Build a voice-over tool that generates spoken audio in a custom voice from a written text script

Open on GitHub → Full breakdown on explaingit →