speech-to-speech

Python ★ 4.9k updated 1d ago

Build local voice agents with open-source models

This project lets you build a voice agent that runs entirely on your own computer using open-source AI models. You speak to it, it understands you, thinks of a response, and speaks back. No paid API is required, though you can optionally connect to one for the language model step.

The pipeline has four stages that pass data from one to the next. First, voice activity detection listens to the microphone and detects when you are actually speaking. Second, a speech-to-text model transcribes your words. Third, a language model reads the transcription and generates a text reply. Fourth, a text-to-speech model turns that reply into audio you hear. Each stage is swappable: you can pick from a list of supported models for each one depending on your hardware and preference.

You can run the pipeline in several modes. The local mode runs everything on one machine. The server and client mode splits the heavy models onto a server while a lightweight client handles audio. There is also a WebSocket mode and a mode that exposes a real-time API compatible with other apps. On Apple Silicon machines, several of the models have optimized versions that run much faster.

Installation is through a standard Python package manager. The base install covers the most common voice-agent path, and optional extras let you add specific backends for faster transcription, voice cloning, or other features. The project comes from Hugging Face and defaults to models available on their model hub.

Open on GitHub → Full breakdown on explaingit →