x-transformers
A concise but complete full-attention transformer with a set of promising experimental features from various papers
x-transformers is a Python library that lets researchers and developers build transformer neural networks, the type of AI model that powers tools like ChatGPT and image recognition systems. The library is published on PyPI and installs with a single pip command, so getting started does not require any complex setup. It supports several common model shapes: a paired encoder and decoder (useful for tasks like translation), a decoder-only design similar to GPT, and an encoder-only design similar to BERT.
Beyond the standard shapes, the library includes support for vision transformers, which process images by breaking them into small patches and feeding those patches through attention layers. This makes it possible to build image classifiers or image-to-caption systems within the same codebase. A pre-built configuration following the SimpleViT paper is included for image classification tasks.
One of the most practically useful features is Flash Attention, a memory-saving technique for processing long sequences. Traditional attention calculations grow quickly in memory as sequence length increases; Flash Attention processes the calculation in tiles, keeping memory use roughly constant relative to length while also running faster. Setting a single flag in the model configuration turns this on for anyone running PyTorch 2.0 or later.
The library also experiments with persistent memory key-value pairs, a technique from research suggesting that adding a small set of learned memory slots to the attention layer can match or exceed standard feedforward networks on some tasks. Other optional features include various dropout configurations for regularizing training, and support for prepending image embeddings to text encoder inputs, which is the approach used by the PaLI vision-language model.
The project is aimed at AI researchers who want to test novel attention mechanisms and architectural ideas in a single, clean codebase rather than modifying large, complicated frameworks. The README is very code-heavy and assumes familiarity with PyTorch and neural network training. The full README is longer than what was shown.