ktransformers

Python ★ 17k updated 2d ago

A Flexible Framework for Experiencing Heterogeneous LLM Inference/Fine-tune Optimizations

Research toolkit for running very large AI language models on modest hardware by offloading expert layers to CPU, letting you run models like DeepSeek-V3 without a fleet of expensive high-end GPUs.

PythonCUDAC++pipsetup: hardcomplexity 5/5

KTransformers is a research project for running and fine-tuning large language models efficiently by splitting the work between CPU and GPU. The core idea is that modern LLMs, especially Mixture-of-Experts models, are too big to fit comfortably in GPU memory, so KTransformers offloads parts of the computation to the CPU while keeping the hot path on the GPU. This lets people run very large models on smaller, cheaper hardware.

The project exposes two user-facing capabilities from its kt-kernel source tree: Inference and SFT (supervised fine-tuning). On the inference side, kt-kernel provides CPU-optimized kernel operations using Intel AMX and AVX512/AVX2 instructions for INT4 and INT8 quantized models, NUMA-aware memory management for Mixture-of-Experts inference, and CPU-side quantized weights paired with GPU-side GPTQ support. It exposes a Python API for integration with SGLang and other serving frameworks. On the fine-tuning side, KTransformers integrates with LLaMA-Factory so users can fine-tune very large MoE models, such as DeepSeek-V3 and R1, on limited GPU memory.

You would use KTransformers if you want to serve or fine-tune cutting-edge open models on consumer or modest data-center hardware without paying for a fleet of high-end GPUs. The README lists supported models including DeepSeek-V3 and R1, Kimi-K2, GLM-5, Qwen3, MiniMax, and others. The codebase is Python with native kernels underneath, and installation is via pip from the kt-kernel directory. The full README is longer than what was provided.

Where it fits

Run DeepSeek-V3 or Qwen3 on a consumer PC with a single GPU by offloading model experts to CPU.
Fine-tune a large Mixture-of-Experts model on limited GPU memory using the LLaMA-Factory integration.
Serve a quantized INT4 language model locally using CPU AMX or AVX512 kernels for fast inference.
Use KTransformers as a backend for SGLang to serve large models at lower hardware cost.

Open on GitHub → Full breakdown on explaingit →