LWM

Python ★ 7.4k updated 1y ago

Large World Model -- Modeling Text and Video with Millions Context

An AI model that processes up to one million tokens of text, images, and video at once, letting you ask questions about very long documents or hour-long video clips.

PythonJAXPyTorchTPUCUDAsetup: hardcomplexity 5/5

Large World Model (LWM) is an AI system that can read and understand very long pieces of text, images, and video all at once. Most AI models can only look at a limited amount of information at a time, similar to reading just the first few pages of a book before answering questions about it. LWM extends that window to one million tokens, which is roughly the equivalent of several full-length novels or about an hour of video, allowing it to answer questions about content that appears anywhere in a very long document or clip.

The project trains a 7-billion-parameter neural network on a large collection of books and diverse videos. The training process uses a technique called RingAttention, which distributes the work of processing very long sequences across many processors simultaneously. Without this, training on such long inputs would exceed the memory limits of any single piece of hardware. The team gradually increased the context size during training, starting at 4,000 tokens and working up to one million.

The released models come in several variants. Some are text-only, others understand both text and video. Some are base models, while others are chat-tuned versions you can have a conversation with. The vision-language models run on TPUs using a framework called Jax, while the text-only versions also work with PyTorch on standard GPUs. The README includes setup instructions, a table listing each available model along with its context size and download links, and guidance on configuration parameters that control how computation is split across hardware.

Practical capabilities demonstrated in the README include retrieving specific facts buried inside a one-million-token document with high accuracy, answering questions about the content of a one-hour YouTube video, chatting about individual images, and generating images or short video clips from text prompts. The code is supported on Ubuntu; Windows and macOS have not been tested.

Where it fits

Ask questions about facts buried anywhere inside a very long document such as a full book or lengthy legal contract.
Query the content of an hour-long YouTube video by feeding it directly to the model.
Use the chat-tuned variant to have a conversation about a set of images without splitting them into smaller batches.

Open on GitHub → Full breakdown on explaingit →