fish-speech

Python ★ 31k updated 12d ago

SOTA Open Source TTS

Fish Speech is an open-source text-to-speech system that converts written text into realistic spoken audio across 80+ languages, with fine-grained emotional control using inline tags like [whisper] or [excited] inserted directly into the text.

Pythonsetup: moderatecomplexity 3/5

Fish Speech is an open-source text-to-speech system, meaning software that converts written text into spoken audio. Its focus is on producing speech that sounds natural and expressive — not robotic — across more than 80 languages.

The system works using a model called S2 Pro, which has a two-stage architecture. A larger component (described as the "slow" part) reads the text and determines the overall meaning and timing of what is being said. A smaller, faster component then fills in the fine acoustic details that make the voice sound realistic. Together they produce audio that scores highly on benchmarks measuring how close AI-generated speech sounds to a real human speaker.

A key feature is fine-grained emotional control: you can insert short tags directly into the text — such as [whisper], [excited], or [laughing] — at any point, and the model adjusts how those words are spoken accordingly. This makes it suitable for applications like audiobook narration, voice assistants, or interactive storytelling where tone and emotion matter.

You would use this if you need to generate realistic spoken audio from text programmatically — for example, building a voice interface, generating audio content at scale, or experimenting with voice cloning. It can be run from a command line, through a web interface, or via a server API. The tech stack is Python, and the model weights are published on HuggingFace. The license restricts usage to non-commercial purposes; check the terms before using in a product.

Where it fits

Generate realistic voiceover audio for an audiobook or narration project with emotional tone control built into the script.
Build a voice interface or assistant that speaks back to users with natural-sounding, expressive speech in multiple languages.
Automate audio content creation at scale by sending text to the server API and receiving spoken audio files in return.
Experiment with voice generation and emotional expression by inserting tags like [laughing] or [whisper] into sample text via the web interface.

Open on GitHub → Full breakdown on explaingit →