whisper
Robust Speech Recognition via Large-Scale Weak Supervision
Whisper is OpenAI's open-source speech recognition model that converts audio files to text, supports 99+ languages, and can translate spoken words directly into English, installed with a single pip command.
Whisper is OpenAI's speech recognition system. The README describes it as a general-purpose speech recognition model trained on a large dataset of diverse audio, and as a multitasking model that can handle multilingual speech recognition, speech translation, and language identification. In everyday terms, you hand it a sound file and it gives you back the text of what was said — either as a transcript in the original language, or translated into English.
Under the hood it is a Transformer sequence-to-sequence model: the audio is turned into a numerical representation and the model predicts the corresponding text. Several different speech tasks are encoded as a single sequence of tokens with special control tokens marking which task is being asked, which is how one model can replace what used to be several separate stages of a speech pipeline. When transcribing, the audio is processed with a sliding 30-second window.
Whisper comes in six model sizes — tiny, base, small, medium, large, and turbo — ranging from 39 million up to 1.55 billion parameters and from roughly 1 GB up to 10 GB of required VRAM, with corresponding tradeoffs between speed and accuracy. Four of the sizes also have English-only versions that tend to perform better on English. The turbo model is described as an optimized version of large-v3, faster than large but not trained for translation.
You install it as a Python package with pip install -U openai-whisper. It depends on PyTorch and on the ffmpeg command-line tool to read audio files. There is a whisper command for transcribing from the shell and a small Python library for using the model from code.
Where it fits
- Transcribe a recorded meeting, podcast, or lecture audio file to text without a paid cloud transcription service.
- Add subtitles to a video by feeding the audio to Whisper and formatting the timestamped output as SRT.
- Build a multilingual speech-to-text feature in your app that handles dozens of languages with a single model.
- Translate spoken foreign-language audio directly into English text without a separate translation step.