qlora
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA is a research technique and codebase that lets you fine-tune very large AI language models on a single consumer GPU by compressing the model to 4-bit precision and training only small adapter modules attached to the frozen model.
QLoRA is a research technique developed at the University of Washington that lets you customize (or "fine-tune") very large AI language models on hardware that would normally be far too small to handle them. Language models are software systems trained on huge amounts of text that can answer questions, summarize content, write code, and more. Fine-tuning means taking one of these already-trained models and teaching it to behave differently, usually by training it further on a smaller dataset you choose.
The core problem QLoRA addresses is that large models require enormous amounts of GPU memory to train. A model with 65 billion parameters would normally need multiple high-end GPUs working together. QLoRA shrinks the model's memory footprint by compressing its stored numbers from 16-bit values down to 4-bit values, a process called quantization. This compression alone would degrade quality, but QLoRA adds a second technique: it attaches small trainable modules called Low Rank Adapters to the compressed model, and only trains those small modules rather than the entire model. The result is that fine-tuning a 65B-parameter model fits on a single GPU with 48 gigabytes of memory, and the fine-tuned model performs comparably to one trained the full expensive way.
The repository also includes Guanaco, a family of chatbot models that the authors produced using QLoRA on the OpenAssistant dataset. The README reports that Guanaco 65B reached 99.3% of ChatGPT's performance on a standard benchmark after 24 hours of fine-tuning on one GPU. Those models are available separately on Hugging Face.
The code integrates with widely used tools from Hugging Face, a popular platform for AI model hosting and training utilities. Installation requires Python, PyTorch, and a few supporting libraries. The repository includes example scripts, Jupyter notebooks for running experiments in Google Colab, and configuration options for single-GPU and multi-GPU setups. The codebase is released under the MIT license, though the Guanaco models inherit restrictions from the underlying LLaMA models they were built on.
Where it fits
- Fine-tune a 65B-parameter language model on your own dataset using a single 48GB GPU
- Train a custom chatbot on domain-specific text without access to expensive multi-GPU clusters
- Run QLoRA fine-tuning experiments in Google Colab using the included Jupyter notebooks
- Use the Guanaco models as a starting point for building a chat assistant approaching ChatGPT quality