optimum
π Accelerate inference and training of π€ Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools
Optimum is a Python library from Hugging Face that makes AI models run faster and more efficiently on specific hardware. It extends the popular Transformers, Diffusers, and related libraries by adding tools to convert models into formats that specialized chips and runtimes can execute more quickly than they could with standard PyTorch alone.
The main use case is taking a model you have already trained (or downloaded from Hugging Face) and preparing it to run in production. Optimum can export models to ONNX, which is a widely used open format for sharing AI models between different software systems. Once in that format, the model can be run by ONNX Runtime, a fast execution engine that works on CPUs and GPUs. Optimum provides Python classes that handle this transparently, so you call the model the same way you always did, but it runs faster underneath.
Beyond ONNX, Optimum connects Hugging Face models to several specialized hardware backends. For Intel processors and accelerator cards, it integrates with OpenVINO. For Amazon Web Services cloud instances that use custom AI chips (called Inferentia and Trainium), it provides matching support. Intel Gaudi accelerators (purpose-built AI training and inference cards) are also supported. Each backend has its own installation step, since the underlying hardware drivers and toolkits differ.
Optimum also supports quantization, which is the process of reducing the numerical precision of a model's weights. A quantized model uses less memory and runs faster while accepting a small reduction in accuracy. Several quantization backends are available, including Quanto, which is a PyTorch-native option.
This library is aimed at developers who have a working model and want to deploy it efficiently without rewriting their code from scratch for each hardware target.