Voices-in-the-Wild-Bench

Python ★ 24 updated 21d ago

Bilingual Chinese and English benchmark of 5000 noisy real and synthetic audio clips with a Python toolkit to score ASR models like Whisper and Canary.

PythonWhisperNeMoTransformerssetup: moderatecomplexity 3/5

Voices-in-the-Wild-Bench is a benchmark dataset and a small evaluation toolkit for speech recognition systems. The goal is to measure how well speech and voice assistant systems hold up when the audio is messy in ways that are common in everyday life, rather than the clean studio recordings that most academic benchmarks use. It covers both Chinese and English.

The benchmark contains 5,000 audio examples. Of these, 3,500 are synthetic speech with controlled perturbations and 1,500 are real recordings from sixteen human speakers. The split between Mandarin Chinese and English is even, with 2,500 samples each. Every clip is tagged with one of eight acoustic conditions: noise, far field, obstructed speech, distortion, recording artifacts, echo, dropout, and a mixed category that combines several conditions in one clip. The repository ships eight short example audio files, one per category, so that you can smoke-test your evaluation pipeline before downloading the full set from Hugging Face.

Each sample is stored as a JSONL record with an index, an audio path, an instruction, a reference answer, a subset label that encodes the source type, language, and acoustic condition, and an empty prediction field that the evaluated model fills in.

The README documents how to score predictions and how to run models. Chinese audio is scored with character error rate and English audio with word error rate. The evaluate_predictions.py script reports an overall score, a language-wise breakdown, and a real versus synthetic breakdown for each acoustic category. There is also a run_inference.py script for running included model wrappers, with the first public wrappers being Whisper-Large-v3 from OpenAI through the Transformers pipeline, Mega-ASR which is described as the public name for the authors' own merged_v2 model, and Canary-1b-v2 through NVIDIA NeMo.

The repository links out to a leaderboard site, a paper, the dataset on Hugging Face, and an issues page for submitting new results. The release notes at the top of the README show that the initial skeleton went up on 2026-05-16, and reproducible evaluation utilities, example records, and the first two model wrappers were added two days later.

Where it fits

Benchmark a new ASR model against noisy real-world Chinese and English audio
Reproduce CER and WER scores for Whisper-Large-v3 and Canary-1b-v2
Submit results to the public leaderboard for speech recognition robustness
Add a custom model wrapper to evaluate it across eight acoustic conditions

Open on GitHub → Full breakdown on explaingit →