Whisper

C++ ★ 11k updated 28d ago

High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model

A Windows desktop app and developer library that transcribes audio or video to text using your GPU, runs OpenAI's Whisper model roughly twice as fast as the original, with no Python installation required.

C++DirectComputeDirect3D 11C#PowerShellNuGetsetup: moderatecomplexity 3/5

OpenAI's Whisper is a speech recognition system that converts spoken audio into text. This project brings that capability to Windows by running it entirely on the GPU, making it much faster than the original Python-based version. On a mid-range graphics card, it can convert a three-and-a-half-minute audio clip in about 19 seconds, compared to 45 seconds with the standard approach.

The project ships a ready-to-use desktop application called WhisperDesktop. You download a model file (around 1.4 gigabytes), point it at an audio or video file, and it produces a transcript. There is also a live capture mode that listens to a microphone and transcribes speech in real time, with a detection system that ignores silence and only processes actual speech.

Under the hood, the project uses Windows graphics infrastructure (DirectCompute, part of Direct3D 11) to run the AI model on your GPU rather than your processor. This is vendor-agnostic: it works with graphics cards from Nvidia, AMD, and Intel, as long as the card was made after roughly 2012. The entire runtime fits in a 431-kilobyte DLL, compared to nearly 10 gigabytes of dependencies required by the original Python version.

For developers who want to build this into their own software, there is a COM-style programming interface that works with C++ or C#. A pre-built C# wrapper is available through NuGet, and there is also scripting support for PowerShell. The source code compiles with the free Community edition of Visual Studio 2022.

The project only runs on 64-bit Windows (Windows 8.1 or later). It requires a CPU with AVX1 instruction support, which covers most desktop and laptop processors from 2011 onward. Performance varies by GPU, with the author noting that cards with faster memory tend to produce the best results.

Where it fits

Transcribe audio or video files to text on Windows in seconds using your GPU, without installing Python.
Add real-time microphone transcription to a Windows application via the COM API or C# NuGet wrapper.
Build a desktop tool that captions video files automatically by integrating the provided C# library.
Transcribe long recordings fully offline, with no cloud service or internet connection required.

Open on GitHub → Full breakdown on explaingit →