PaddleNLP

Python ★ 13k updated 29d ago

Easy-to-use and powerful LLM and SLM library with awesome model zoo.

A Python library from Baidu for training, fine-tuning, and deploying large language models like LLaMA, Qwen, and DeepSeek, with support for Chinese-made AI chips alongside Nvidia GPUs.

PythonPaddlePaddleCUDANvidia GPUAscend NPUKunlun XPUsetup: hardcomplexity 5/5

PaddleNLP is a Python library for building, training, and running large language models. It comes from Baidu's PaddlePaddle AI team and is designed to make working with modern AI language models practical for real-world applications. The readme is written primarily in Chinese, reflecting its origin and primary user community.

The library covers the full workflow: pre-training a model from scratch, fine-tuning an existing model on your own data, compressing a model so it runs faster or on smaller hardware, and deploying it for production use. It supports popular open model families including LLaMA, Qwen, DeepSeek, Mistral, Baichuan, ChatGLM, Gemma, and others. Recent updates added support for Qwen3 and DeepSeek-R1, including quantized inference that can reach over 2,100 tokens per second on a single machine.

One notable feature is multi-hardware support: the library works across Nvidia GPUs as well as several Chinese-made chips (Kunlun XPU, Ascend NPU, Hygon DCU, and others), with a consistent interface that lets you switch hardware without rewriting your code. This is particularly relevant for teams in China who may not have access to or want to depend on Nvidia hardware.

For fine-tuning, it includes an efficient training pipeline with FlashMask, a custom attention operator that reduces wasted computation on padded sequences. Checkpoints can be saved and restored quickly, with a compression feature that cuts storage space by about 78 percent. There is also a model merging tool called MergeKit to combine weights from multiple fine-tuned versions.

The full README is longer than what was shown.

Where it fits

Fine-tune a LLaMA, Qwen, or DeepSeek model on your own dataset using PaddleNLP's efficient training pipeline.
Run quantized DeepSeek-R1 inference at over 2,100 tokens per second on a single machine.
Train and deploy models on Chinese-made AI chips like Kunlun XPU or Ascend NPU using the same code as Nvidia GPUs.
Merge weights from multiple fine-tuned model versions into a single model using the built-in MergeKit tool.

Open on GitHub → Full breakdown on explaingit →