CogVideo
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Open-source AI models from Tsinghua University that generate short video clips from a text description or a starting image. The 2B model runs on older consumer GPUs and the 5B model fits on an RTX 3060.
CogVideo and CogVideoX are open-source AI models for generating videos from text descriptions or from images. You write a prompt describing what you want to see, and the model produces a short video clip matching that description. The project comes from researchers at Tsinghua University and ZhipuAI in China and spans two generations: the original CogVideo published at a major AI conference in 2023, and the newer CogVideoX series released in 2024.
The CogVideoX series comes in two sizes, 2 billion and 5 billion parameters, which refer to the scale of the underlying model. The smaller 2B model can run on older graphics cards like an NVIDIA GTX 1080 Ti, while the 5B model fits on a consumer desktop card like an RTX 3060. A larger CogVideoX1.5 variant supports longer videos of up to 10 seconds at higher resolution. The models support three tasks: generating a video purely from a text prompt, continuing an existing video, and generating a video starting from an image combined with a text prompt.
To use the models, you install the required Python packages and run inference scripts from the command line. The README notes that using a large language model like GPT-4 or GLM-4 to rewrite and expand your prompt before feeding it to CogVideoX significantly improves output quality, because the model was trained on long, detailed descriptions rather than short phrases.
Fine-tuning is also supported for users who want to adapt the model to specific visual styles or content types. A separate fine-tuning toolkit called CogKit was released in early 2025. The README documents two code paths for running the models: one using a framework called SAT, aimed at researchers who want to modify the model internals, and one using the Hugging Face Diffusers library, which is simpler and more familiar to practitioners. Online demos are available on Hugging Face Spaces and ModelScope for trying the 5B model without installing anything.
Where it fits
- Generate a 5-second video clip from a text description for social media content or product demos.
- Animate a still image by pairing it with a text prompt describing what should happen in the scene.
- Fine-tune the model on a custom dataset of short clips using CogKit to produce content in a specific visual style.
- Continue an existing video clip by providing it as input alongside a new text prompt.