Mega-ASR

Python ★ 978 updated 9d ago

First foundation ASR built for the real world - 7 atomic acoustic conditions, 54 compound scenarios, 2.6M samples, and up to ~30% gains over SOTA where every other model falls apart. **You'll come back to MEGA-ASR, after the rest fail in the wild. ⭐**

Tsinghua speech recognition foundation model tuned for noisy, far-field, in-the-wild audio, claiming up to 30 percent lower WER than Whisper and Qwen3-ASR on hard clips.

PythonPyTorchHugging Facesetup: hardcomplexity 4/5

MEGA-ASR is a speech recognition model from a group at Tsinghua University aimed at transcribing audio captured in messy real-world conditions, rather than the clean studio recordings that most speech models are tested on. The README frames it as a foundation model for what the authors call in-the-wild speech recognition, meaning audio with background noise, far-field microphones, obstructions, echoes and reverberation, recording artifacts, electronic distortion, and dropped pieces of transmission. The training set is described as 2.6 million samples covering 7 atomic acoustic conditions and 54 compound scenarios where those conditions stack on top of each other. The authors report up to roughly 30 percent gains over leading open and closed source models on these harder cases. Two training techniques are named in the README: A2S-SFT for supervised fine-tuning, and a reinforcement learning step called DG-WGPO. The README does not explain what those acronyms stand for or how they work in detail, so a non-technical reader will mostly take them as the labels of the recipes used. Most of the README is a side-by-side comparison table where short audio clips are transcribed by MEGA-ASR and by other systems, including Qwen3-ASR, Gemini-3-Pro, Seed-ASR, and Whisper. Each row shows the ground-truth text, each model's transcription, and a Word Error Rate score. In the examples shown, MEGA-ASR produces lower error rates on the hard clips while the other systems often return empty output, hallucinate unrelated text, or drop large portions of the sentence. The project links out to a technical report on arXiv, the Voices-in-the-Wild-2M training dataset on Hugging Face, the model weights on Hugging Face, a separate benchmark repository called Voices-in-the-Wild-Bench, and a project page. The README in this repository is mostly the marketing-style introduction and the comparison samples.

Where it fits

Transcribe noisy real-world audio with strong background interference
Benchmark in-the-wild speech recognition against Whisper and Qwen3-ASR
Fine-tune an ASR foundation model on custom acoustic conditions
Research A2S-SFT and DG-WGPO training recipes

Open on GitHub → Full breakdown on explaingit →