Chinese-BERT-wwm

Python ★ 10k updated 2mo ago

Pre-Training with Whole Word Masking for Chinese BERT（中文BERT-wwm系列模型）

Pre-trained Chinese BERT language models using Whole Word Masking, ready to load with two lines of Python via HuggingFace Transformers for tasks like text classification, named entity recognition, and question answering.

PythonPyTorchHuggingFace TransformersBERTsetup: moderatecomplexity 3/5

This repository provides a set of pre-trained Chinese language models based on BERT, a type of AI model used in natural language processing tasks. The core contribution is the application of a training technique called Whole Word Masking to Chinese text. In the standard BERT training approach, individual characters are randomly hidden and the model learns to predict them. Because Chinese is written without spaces between words, standard masking might hide only part of a multi-character word. Whole Word Masking fixes this by hiding all characters of a word at once, which helps the model learn word-level meaning rather than just character-level patterns.

The repository distributes several model variants: the base BERT-wwm model trained on Chinese Wikipedia, an extended version trained on a much larger dataset of 5.4 billion words drawn from Wikipedia, news, and question-answer sources, a larger RoBERTa-based version that uses the same masking technique with additional training improvements, and several smaller 3-layer and 4-layer versions for situations where a lighter model is needed.

All models are available to download from HuggingFace (where they can be loaded with two lines of Python using the Transformers library) or from Chinese cloud storage for users in mainland China. The download files include model weights, a configuration file, and a vocabulary list.

These models are intended as starting points for downstream tasks such as text classification, named entity recognition, question answering, and sentence similarity. A researcher or developer building a Chinese language application would load one of these models and then train it further on their own labeled data, rather than training a language model from scratch. The README also includes benchmark results across several standard Chinese NLP evaluation sets to show how each variant compares.

Where it fits

Fine-tune a Chinese text classifier for customer review sentiment analysis using the pre-trained weights.
Build a Chinese named entity recognition pipeline for news articles by fine-tuning on labeled data.
Set up a Chinese question-answering system for a search or support product with HuggingFace Transformers.
Use the lightweight 3-layer model variant for fast Chinese sentence similarity scoring in production.

Open on GitHub → Full breakdown on explaingit →