airllm

Jupyter Notebook ★ 21k updated 2d ago

AirLLM 70B inference with single 4GB GPU

A Python package that runs 70-billion-parameter language models on a single small GPU by streaming model layers from disk one at a time, so you don't need a big multi-GPU server.

PythonPyTorchHugging FaceCUDAsetup: moderatecomplexity 3/5

AirLLM is a Python package that lets you run very large language models on a modest GPU. Normally a 70-billion-parameter model would not fit in the memory of a small graphics card, so people are forced to use big multi-GPU servers, pay for hosted APIs, or shrink the model through techniques like quantization that hurt quality. AirLLM's pitch is that it can run a 70B model on a single 4GB GPU card without quantization, distillation, or pruning, and the README adds that it can now run a 405B Llama 3.1 model on as little as 8GB of VRAM.

It works by reorganising how the model is held in memory. Instead of loading the whole model at once, AirLLM splits the model into its transformer layers, saves them layer-by-layer to disk, then streams the layers through the GPU one at a time during inference, with prefetching so loading a layer overlaps with computing on the previous one. The 2.0 release added an optional block-wise quantization mode that can compress weights to 4-bit or 8-bit for up to a 3x speedup, since the bottleneck is disk loading rather than arithmetic. Inference itself looks similar to using a normal Hugging Face transformer: install with pip, call AutoModel.from_pretrained with a Hugging Face repo ID or a local path, tokenize an input, and call generate.

You would reach for AirLLM when you want to experiment with a large open model locally — for example a 70B Llama variant or one of the supported families like ChatGLM, Qwen, Baichuan, Mistral, or InternLM — but you only have a single consumer GPU or a Mac with Apple Silicon. The README also notes CPU inference and Mixtral support. The package is Python, distributed on PyPI as airllm, and licensed Apache 2.0.

Where it fits

Run a 70B Llama model locally on a single consumer GPU with 4GB of VRAM without quantization or quality loss.
Experiment with large open-source models like Mistral or Qwen on a Mac with Apple Silicon.
Generate text from a 405B Llama 3.1 model on a machine with only 8GB of VRAM.

Open on GitHub → Full breakdown on explaingit →