WhisperLiveKit

Python ★ 10k updated 8d ago

Simultaneous speech-to-text models

A self-hosted server that converts spoken audio to text in real time with very low delay, supports around 200 languages, identifies who is speaking, and works on NVIDIA GPUs, Apple Silicon, and standard CPUs.

PythonWebSocketREST APICUDAMLXWhisperVoxtral Minisetup: moderatecomplexity 3/5

WhisperLiveKit is a self-hosted speech-to-text server designed to transcribe spoken audio in real time with very low delay between when someone speaks and when the text appears. Unlike running a basic transcription model that waits for a full pause before processing, this tool uses research-grade streaming algorithms that process audio incrementally and produce output as speaking continues, not just after a sentence ends.

The project supports speaker identification, meaning it can label who is talking when multiple people are in a conversation. It handles translation between roughly 200 languages through a separate translation component. Voice Activity Detection is built in so the server does not waste processing time when no one is speaking.

Installation is a single pip command. Once running, the server exposes three different API styles: a REST endpoint that matches the OpenAI audio transcription format (so existing code written against OpenAI can point at it instead), a WebSocket endpoint compatible with the Deepgram SDK, and a native WebSocket for real-time streaming. A Chrome browser extension is included for capturing audio from web pages directly.

The tool also works offline for file transcription without starting a server at all. You can feed it an audio or video file and get a plain text transcript or an SRT subtitle file. A model management sub-command lets you download, list, and delete transcription models.

Hardware support covers NVIDIA GPUs with CUDA, Apple Silicon via the MLX framework, and standard CPUs. A second model backend called Voxtral Mini (a 4-billion-parameter model from Mistral AI) is offered as an alternative to Whisper, with better per-chunk language detection across 100-plus languages. The code is Apache 2.0 licensed.

Where it fits

Build a live captioning tool for video calls or live streams that shows words as people speak.
Create SRT subtitle files from audio or video recordings without sending data to a cloud service.
Replace OpenAI's transcription API with a local, private alternative using the same REST format.
Add multi-speaker labeling to a meeting recorder so the transcript shows who said what.

Open on GitHub → Full breakdown on explaingit →