streaming-llm

Python ★ 7.2k updated 1y ago

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

StreamingLLM lets AI language models run indefinitely in long chat sessions without crashing or forgetting, by keeping only the first few anchor tokens and the most recent messages in memory.

PythonPyTorchHuggingFace Transformerssetup: hardcomplexity 4/5

StreamingLLM is a research project from MIT Han Lab that addresses a practical limitation of AI language models in long-running applications. Language models like Llama-2 and similar systems are trained to handle text only up to a certain length, called a context window. In a long back-and-forth chat session, the conversation can grow beyond that limit, at which point the model either needs to restart and forget what was said earlier, or spend significant compute time reprocessing the recent history. Both options are costly.

The key observation behind this project is called an attention sink. When a language model processes text, it assigns attention scores to tokens to decide which parts of the text to focus on. The researchers found that early tokens in a sequence receive very high attention scores regardless of how important they actually are, acting as a kind of anchor. Removing those early tokens causes the model's quality to drop noticeably, even if they contain little useful content.

StreamingLLM works by keeping two things in memory: the most recent tokens the model has seen, and the initial anchor tokens that serve as attention sinks. Everything in the middle gets discarded. This allows the model to keep running indefinitely without resetting its memory, and without the cost of recomputing past states. According to the paper, this approach achieves up to 22 times the speed of an alternative method called sliding window recomputation.

It is important to understand what this does not do. The model's context window does not grow. The model cannot see or reason about the tokens that were discarded from the middle of a long conversation. Feeding an entire book into StreamingLLM and asking for a summary would only produce a summary of the final pages, because the model can only work with what is currently in its window. The project is designed for continuous dialogue and assistant-style applications where the model needs to keep running without crashing, not for tasks requiring full-document comprehension.

StreamingLLM has been integrated into HuggingFace Transformers, NVIDIA TensorRT-LLM, and Intel's extension for Transformers. It supports Llama-2, MPT, Falcon, and Pythia. The paper was accepted at ICLR 2024.

Where it fits

Build a chat assistant that runs non-stop for hours without resetting context or crashing.
Integrate efficient streaming inference into Llama-2, Falcon, or MPT models via HuggingFace.
Research how attention sink tokens enable infinite-length generation without full-context recomputation.

Open on GitHub → Full breakdown on explaingit →