silero-models

Jupyter Notebook ★ 6.0k updated 17d ago

Silero Models: pre-trained text-to-speech models made embarrassingly simple

Silero Models is a collection of pre-trained text-to-speech models that convert written text into spoken audio. You give the library a string of text and it returns an audio file with a natural-sounding voice reading it aloud. The project emphasizes that setup should be minimal: in most cases, loading a model and generating speech takes only a few lines of Python code.

The models are built with a particular focus on Russian and other languages from the post-Soviet region, though support has expanded to include Azerbaijani, Armenian, Bashkir, Belarusian, Georgian, Kazakh, Kyrgyz, Tajik, Ukrainian, Uzbek, and several Indic languages. For Russian specifically, the models handle stress marks and homographs automatically, meaning the system can figure out how a word should be pronounced even when the same spelling has multiple pronunciations depending on context.

Several generations of models are available (V3, V4, V5), with the V5 series being the most current. Each version supports multiple named voices and can output audio at different sample rates to suit different quality needs. The newer models also support SSML, a markup language that lets you control pacing, emphasis, and pronunciation in the generated speech.

The models can be loaded through PyTorch Hub or installed as a Python package via pip. They run on both CPU and GPU and are designed to be fast enough for practical use without requiring specialized hardware. The license for the main Russian models is Creative Commons Attribution-NonCommercial 4.0, meaning free use is allowed but commercial applications require a separate arrangement. Some of the CIS regional language models are available under the more permissive MIT license. The full README is longer than what was shown.

Open on GitHub → Full breakdown on explaingit →