gitmyhub

LT2

Python ★ 40 updated 23d ago

Official Codebase: LT2: Linear-Time Looped Transformers.

Research code for a new language model architecture that replaces slow attention with faster alternatives and reuses layers in loops, achieving roughly 2.7x faster decoding than a standard baseline.

PythonPyTorchCUDASLURMMamba2FlashAttentionsetup: hardcomplexity 5/5

LT2 is the research codebase accompanying an academic paper about a new language model architecture called Linear-Time Looped Transformers. The problem it addresses is a well-known inefficiency in standard transformer models: the attention mechanism, which allows the model to weigh how relevant each word is to every other word, becomes dramatically slower and more memory-intensive as text gets longer. LT2 proposes replacing that attention step with alternatives that scale more efficiently with sequence length.

The "looped" part of the name refers to parameter sharing. Instead of having many distinct layers with their own separate learned values, LT2 reuses the same set of layers multiple times in sequence. A model with 20 physical layers run through 4 loops effectively behaves like an 80-layer model but uses only 20 layers' worth of memory. This is a known technique, and the contribution here is applying it specifically to the faster attention alternatives.

Three variants are included. LT2-linear replaces attention with linear-attention methods such as Mamba2, DeltaNet, and RetNet, which process tokens using a small fixed-size memory state rather than comparing all tokens pairwise. LT2-sparse uses sliding-window attention, where each token only attends to nearby tokens rather than the whole sequence. LT2-hybrid mixes a small number of standard attention layers in with the faster linear-attention layers; according to the paper, this hybrid reaches better quality than a standard looped transformer while running decode at about 2.7 times the speed.

The repository is built on Meta's Lingua pre-training framework and is structured for training on GPU clusters, either via SLURM job scheduling or torchrun for local multi-GPU setups. It includes configuration files for reproducing the paper's experiments at 600 million and 1.3 billion parameter scales, training on the FineWeb-Edu dataset. Custom GPU kernel code is included for the performance-critical parts.

This is a research-oriented project aimed at people studying language model architecture. Running it requires significant GPU resources and familiarity with distributed training tooling.

Where it fits