stt

Python ★ 4.6k updated 5mo ago

Voice Recognition to Text Tool / 一个离线运行的本地音视频转字幕工具，输出json、srt字幕、纯文字格式

This is an offline, locally running tool that converts spoken audio or video into text. You give it a video or audio file, choose the language and which AI model to use, and it returns the transcribed text. The output can be saved as a plain text file, a JSON file, or an SRT subtitle file with timestamps, which is the format used for adding captions to videos.

The tool is built on top of an open-source speech recognition model called fast-whisper, which comes in several sizes: tiny, base, small, medium, and large-v3. Smaller models run faster and need less computing power, while larger models produce more accurate transcriptions. You download whichever model size fits your hardware and place it in the models folder.

The README is primarily in Chinese but the tool itself supports over a dozen languages, including Chinese, English, French, German, Japanese, Korean, Russian, Spanish, and others. If your machine has an NVIDIA graphics card and the CUDA software installed, the tool will use it automatically to speed up processing.

There are two ways to run it. Windows users can download a pre-compiled package that starts with a double-click and opens a browser interface for uploading files. Users on Linux, Mac, or Windows who prefer to run from source need Python between versions 3.9 and 3.11, and must also install ffmpeg, a standard tool for working with audio and video files.

Beyond the browser interface, the tool also exposes an API endpoint that is compatible with the same format used by OpenAI's speech-to-text service. This means software that was built to call OpenAI's API can be pointed at this local server instead, with no internet connection required. The project acknowledges fast-whisper, Flask, ffmpeg, and the Layui front-end library as its main dependencies.

Open on GitHub → Full breakdown on explaingit →