FunASR

Python ★ 18k updated 3h ago

Industrial-grade speech recognition toolkit: 170x realtime, 50+ languages, speaker diarization, emotion detection, streaming, and OpenAI-compatible API.

A Python speech recognition toolkit that transcribes audio to text and adds punctuation, speaker labels, and emotion detection, with pretrained models available from ModelScope and Hugging Face.

PythonPyTorchCUDAsetup: hardcomplexity 4/5

FunASR is a toolkit for turning recorded or live audio into text, and for the surrounding jobs that make that text useful. The README describes it as a bridge between academic research and industrial applications in speech recognition, aiming to make it easier for researchers and developers to train, fine-tune, and deploy speech models. The name is a play on ASR, the short name for automatic speech recognition.

Around the core transcription feature, the toolkit bundles related tasks. Voice Activity Detection finds where speech actually occurs in an audio file so silent stretches are skipped. Punctuation Restoration adds commas and full stops to raw transcripts. Speaker Verification and Speaker Diarization figure out who is talking and when speakers change. Multi-talker ASR handles overlapping voices, and there is also keyword spotting and emotion recognition. The project ships pretrained models that can be pulled from ModelScope and Hugging Face. One headline model is Paraformer-large, a non-autoregressive end-to-end model tuned for accuracy and efficient deployment. The toolkit also wires in third-party models such as Whisper-large-v3 and the audio-text Qwen-Audio family. FunASR provides runtime packages for offline file transcription and real-time transcription, including CPU and GPU variants and a Windows SDK.

Someone would reach for FunASR when they need to add transcription or voice analytics to a product, or when they want a starting point for research that involves fine-tuning a strong baseline. The project is written in Python and the topics list and changelog show it builds on the PyTorch ecosystem. The full README is longer than what was provided.

Where it fits

Transcribe audio or video files to text with automatic punctuation using a pretrained Paraformer model.
Run speaker diarization on a meeting recording to separate and label who said what and when.
Fine-tune a speech recognition model on your own audio dataset using the PyTorch training pipeline.

Open on GitHub → Full breakdown on explaingit →