ChineseNLPCorpus

Python ★ 4.6k updated 2y ago

中文自然语言处理数据集，平时做做实验的材料。欢迎补充提交合并。

A curated index of datasets for Chinese natural language processing research, covering reading comprehension, dialogue, text classification, sentiment analysis, named entity recognition, and more, with download links for each.

Pythonsetup: easycomplexity 1/5

This repository is a curated collection of datasets for Chinese natural language processing (NLP) research and experimentation. NLP is the field of computer science that teaches machines to read, understand, and work with human language. Because Chinese has its own grammar, writing system, and linguistic quirks, researchers need datasets specifically built for Chinese text rather than English ones.

The collection is organized into several categories. One section covers reading comprehension, where a model reads a passage and answers questions about it. Datasets here include DuReader from Baidu, which contains 300,000 questions paired with 1.4 million documents, and CMRC 2018 from Harbin Institute of Technology. Another section covers task-oriented dialogue, meaning conversations where a user wants to accomplish something specific, like booking a car or getting a medical diagnosis. Examples include a medical diagnosis dataset from Fudan University built from real online doctor-patient exchanges, and several datasets from the annual SMP and NLPCC evaluation competitions.

There are also datasets for text classification, such as a Toutiao news dataset with 380,000 labeled short articles across 15 categories, and a Tsinghua news corpus covering topics like sports, finance, technology, and entertainment. Sentiment analysis datasets appear as well, covering hotel reviews, food delivery reviews, online shopping reviews across 10 product types, and labeled Weibo posts.

The project also indexes datasets for named entity recognition (identifying people, places, and organizations in text), text similarity, question answering, and knowledge graph tasks. Most entries include the dataset size, the institution that created it, links to the original papers, and download addresses.

This is a reference and index resource, not a software tool. A researcher or developer working on Chinese text AI would browse the tables, find the dataset that fits their task, and download it from the linked source. The README is the main artifact here, and the repository is open to pull requests that add new datasets to the index.

Where it fits

Find the right Chinese text dataset for an NLP task such as sentiment analysis, question answering, or entity recognition
Download labeled Chinese news articles, product reviews, or doctor-patient dialogues to train or fine-tune a language model
Start a Chinese text classification project using the Toutiao news dataset of 380000 labeled short articles across 15 categories

Open on GitHub → Full breakdown on explaingit →