ASRT_SpeechRecognition

Python ★ 8.4k updated 2mo ago

A Deep-Learning-Based Chinese Speech Recognition System 基于深度学习的中文语音识别系统

A research-grade Chinese speech recognition system that converts spoken Mandarin audio into written text using deep learning, runnable as an API server so other apps can send audio and receive transcriptions.

PythonTensorFlowDockerHTTPgRPCsetup: hardcomplexity 5/5

ASRT is a Chinese speech recognition system built with deep learning techniques using Python and TensorFlow. It listens to a short audio clip of spoken Mandarin Chinese and converts it to written text. The project is research-focused and intended for people who want to study or build on Chinese automatic speech recognition, not for plug-in commercial use.

The system works in two stages. The first stage is an acoustic model that takes an audio file and produces a sequence of Chinese phonetic symbols (called pinyin). This model uses deep convolutional neural networks combined with a technique called CTC, which handles the fact that audio and text do not line up character by character. The second stage is a language model that takes the phonetic sequence and converts it to actual Chinese characters, using a statistical probability approach.

To train the system from scratch you need a reasonably powerful machine: at least a 4-core CPU, 16 GB of RAM, and an NVIDIA GPU with 11 GB or more of graphics memory. Training uses publicly available Chinese speech datasets, and the project lists six datasets totaling over a thousand hours of audio. The best-performing version of the acoustic model achieves around 85% accuracy on phonetic recognition when tested against held-out data.

Once trained, the system can be run as an API server over HTTP or gRPC, so other software can send audio data and receive transcribed text. The project provides separate client SDKs for Windows, Python, Go, and Java to make calling the server straightforward. If you do not want to train anything yourself, pre-trained model files are included in the downloadable release packages.

A Docker image is available for running the API server without manual setup, though training still requires a suitable GPU environment. The project is licensed under GPL v3.0.

Where it fits

Study how a two-stage acoustic-plus-language-model pipeline for Mandarin Chinese speech recognition is built and trained.
Run the pre-trained model as an API server and send audio files to it from your own app to get back Chinese text transcriptions.
Train the acoustic model from scratch using publicly available Chinese speech datasets totaling over a thousand hours of audio.
Build a Mandarin transcription feature into your own project using the provided client SDKs for Python, Go, Java, or Windows.

Open on GitHub → Full breakdown on explaingit →