gitmyhub

Flux2-klein-Lite

Python ★ 18 updated 26d ago

Int4-quantized FLUX.2-klein inference runtime (GemLite / fused Triton / eager backends)

A Python library that compresses the FLUX.2-klein image generation model using 4-bit quantization so it runs on GPUs with as little as 2.7 GB of memory instead of the usual 10 GB.

PythonPyTorchCUDAdiffusersHugging Facesetup: moderatecomplexity 3/5

Flux2-klein-Lite is a Python library that makes the FLUX.2-klein image generation model run on graphics cards with less memory than it would normally require. FLUX.2-klein is a 4-billion parameter model for generating images from text descriptions. In its standard form it needs roughly 10 GB of GPU memory to run. This library runs it in a compressed format called 4-bit quantization, reducing that requirement to about 2.7 GB and making it accessible on mid-range consumer GPUs.

The compression works by representing the model's learned values using 4 bits per number instead of the usual 16, packing roughly four times as many weights into the same memory space. The tradeoff is that 4-bit arithmetic is slower than 16-bit arithmetic for the actual computation, so this approach saves memory rather than time. The library is explicit about this: its purpose is running a large model on hardware that would otherwise be unable to load it, not making inference faster.

Three different backends handle the compressed math. The default, gemlite, uses GPU kernel programs tuned for this kind of computation and is both the fastest and most memory-efficient of the three. A second backend called fused is included for environments where gemlite is not available. A third option called eager expands the weights back to 16-bit at load time, restoring full memory usage and serving as a speed baseline for comparison.

The repository includes an example script that generates images by plugging this library into the standard diffusers pipeline. With an additional option to also compress the text encoder (the part that reads your prompt), peak memory during image generation can drop to about 3.3 GB. Weights are downloaded automatically from Hugging Face if not provided locally.

The library is Python-based, requires a CUDA-capable GPU, and is licensed under MIT. A one-time tuning step at first load takes 60 to 90 seconds before inference begins.

Where it fits