AITemplate

Python ★ 4.7k updated 2mo ago

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.

A Python framework from Meta that compiles a trained AI model into highly optimized GPU code for fast inference, removing the need for general-purpose deep learning runtimes like cuBLAS or TensorRT at runtime.

PythonCUDAROCmC++PyTorchsetup: hardcomplexity 5/5

AITemplate (AIT) is a Python framework from Meta that takes a trained neural network model and compiles it into highly optimized C++ GPU code for fast inference. The idea is that rather than running a model through a general-purpose deep learning runtime, you generate a self-contained program specifically tuned for that model and the GPU hardware it will run on. The result is faster execution and no dependency on third-party libraries like cuBLAS or TensorRT at runtime.

It targets two GPU platforms: NVIDIA GPUs (via CUDA, with a focus on Ampere-generation cards and newer) and AMD GPUs (via ROCm/HIP, tested on the MI-210 and MI-250). The framework specializes in half-precision floating-point arithmetic using the dedicated tensor cores these GPUs provide for matrix math. The README describes performance results on models including ResNet, BERT, Vision Transformer, and Stable Diffusion.

A key part of the framework is operator fusion. Rather than executing each neural network operation one at a time, AIT merges sequences of operations into single GPU kernel calls, which reduces overhead and memory traffic. It supports horizontal fusion (merging parallel operations with different input sizes), vertical fusion (folding element-wise operations into matrix operations), and memory fusion (combining data rearrangement steps like splits and concatenations).

A companion tool called FX2AIT converts existing PyTorch models into AIT format. It handles partial conversion for models that include operations AIT does not yet support, keeping those unsupported parts running in PyTorch. The generated AIT runtime can accept PyTorch tensors directly as input without copying data.

Installation requires Docker or a correctly matched CUDA or ROCm compiler. The project is under active development, with planned work on dynamic input shapes, int8 and fp8 quantization, and integration with PyTorch 2.

Where it fits

Compile a PyTorch Stable Diffusion model into an optimized GPU binary for much faster image generation on NVIDIA Ampere hardware.
Speed up BERT or Vision Transformer inference on AMD GPUs without writing custom CUDA kernels.
Convert an existing PyTorch model to AIT format using FX2AIT to benchmark the speed improvement before committing to a migration.

Open on GitHub → Full breakdown on explaingit →