gitmyhub

EmotiVoice

Python ★ 8.5k updated 1y ago

EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine

EmotiVoice is an open-source text-to-speech system that generates expressive speech in English and Chinese with control over emotion, happy, sad, angry, across more than 2,000 voices.

PythonDockerCUDAcondasetup: hardcomplexity 3/5

EmotiVoice is an open-source text-to-speech system that can generate spoken audio from text in both English and Chinese, with control over the emotional tone of the output. When you give it text, you also specify an emotion such as happy, sad, angry, or excited, and the system generates speech that reflects that emotion rather than neutral, flat delivery. It offers more than 2,000 distinct voices to choose from.

The most distinctive aspect compared to basic text-to-speech tools is the prompt-controlled emotion feature. Instead of just converting words to audio, you tell the system how you want the speaker to sound, and it adjusts pitch, speed, and energy accordingly. This makes it useful for content creators, game developers, or anyone building applications that need expressive rather than robotic-sounding speech.

There are several ways to use it. The quickest is through a Docker container: you pull a pre-built image, run it, and access a web interface in your browser. A full local installation uses Python with conda and pip. The system also exposes an API that is compatible with the OpenAI text-to-speech API format, meaning software already built to use OpenAI's speech service could switch to EmotiVoice with minimal changes. A Mac desktop app was also released as a download.

Voice cloning is supported, allowing users to fine-tune the system on their own audio recordings to produce speech in a custom voice. A GPU is required for inference, specifically an NVIDIA GPU when running locally or via Docker. The project was created by Netease Youdao and is released under the Apache 2.0 license.

Where it fits