PIPO

Python ★ 31 updated 9d ago

Implementation of an efficient LLM architecture: the Pair-In / Pair-Out Model (PIPO)

A research project that speeds up AI text generation by compressing two input tokens into one and predicting an extra output token per step, cutting inference time without sacrificing accuracy.

PythonSGLangms-swiftHugging FacePyTorchsetup: hardcomplexity 5/5

PIPO, short for Pair-In, Pair-Out, is a research project proposing a new approach to making large language model inference faster without giving up accuracy. It was developed jointly by Renmin University of China, Xiaohongshu, and other institutions, and the results are described in an arXiv paper.

The key idea pairs two operations that are normally developed separately. On the input side, the model compresses pairs of text tokens into a single internal representation, so it processes fewer units per step. On the output side, a secondary component predicts an extra token alongside each main prediction, so the model produces more text per forward pass. A small confidence module then decides whether each extra predicted token is reliable enough to keep, removing the need for a separate expensive verification step that other approaches require.

Training follows two stages: a standard fine-tuning step using a teacher model's outputs, followed by a distillation phase where a larger 9B parameter model guides a smaller 4B model on math and coding problems. The code builds on top of two existing open-source projects: SGLang for inference and ms-swift for training.

The repository includes scripts to download checkpoints and datasets from Hugging Face, merge trained model weights, run evaluations on several reasoning benchmarks, and reproduce the training runs. Experiments reported in the paper used Qwen 3.5 models at 4B and 9B parameter sizes. At the time of writing, only Qwen 3.5 backbones are supported, and a few known limitations exist around memory requirements during training and certain inference cache optimizations that are currently disabled in the PIPO inference path.

Where it fits

Reproduce the PIPO paper's speed benchmarks on Qwen 3.5 models using the provided evaluation scripts.
Fine-tune a smaller 4B model with PIPO-style distillation from a 9B teacher on your own math or coding dataset.
Run PIPO inference with SGLang to get faster token generation on a supported Qwen 3.5 checkpoint.
Download pre-trained PIPO checkpoints from Hugging Face and merge weights for evaluation.

Open on GitHub → Full breakdown on explaingit →