AlignVid
AlignVid is an ICML 2026 research method that fixes AI video and image models ignoring text prompts by rebalancing internal attention at inference time, no retraining of the base model required.
AlignVid is a research project, accepted at ICML 2026, that addresses a specific problem with AI models that generate video or images from text instructions. The problem is called visual dominance: when you give these models an image and a text prompt asking for significant changes, the model often ignores the text and just reproduces the original image with minor modifications. AlignVid is a method for fixing that without retraining the model.
The fix works by adjusting how the model distributes its attention internally during the generation process, specifically rebalancing how much weight the text description gets versus the visual input. This happens entirely inside the model at inference time, with no changes to the model's weights and no additional training data. The two mechanisms involved are called Attention Scaling Modulation, which sharpens the attention signal toward the text, and Guidance Scheduling, which controls when and where in the network that sharpening is applied.
The same method works across four types of AI generation tasks: converting an image to video using a text prompt, generating video from text alone, generating images from text, and editing existing images. The authors tested it on several publicly available model families and found it improved how faithfully the outputs matched the text prompt, with less than 0.1 percent added computation time.
The code in this repository integrates AlignVid into two specific model families called FramePack and Wan2.1. Using it requires setting up one of those base models first, then enabling AlignVid through a command-line flag when running generation. The default setting uses a single scaling value and the authors report it transfers well across models without needing to search for a different value per model.
The repository also includes a benchmark dataset called OmitI2V, which contains 367 human-annotated examples of add, delete, and modify prompts with questions for evaluating how well a model followed the instruction. The dataset is hosted on Hugging Face and the evaluation code is included in the repository.
Where it fits
- Improve how faithfully FramePack or Wan2.1 follows text prompts when generating video from an image, without any retraining.
- Evaluate an image-to-video model's text alignment using the OmitI2V benchmark of 367 human-annotated examples.
- Apply AlignVid to a text-to-image pipeline to reduce cases where the model reproduces the input instead of following the instruction.