JoyAI-Echo

Python ★ 1.6k updated 5d ago

JoyAI-Echo: Pushing the Frontier of Long Audio-Visual Generation

JoyAI-Echo is a research framework from JD.com that generates long videos up to five minutes in length from text descriptions. Unlike most video generation tools that produce short clips of a few seconds, this project focuses on creating coherent multi-shot sequences where characters look and sound consistent from one scene to the next. It produces synchronized audio alongside the video in a single pipeline rather than adding audio as a separate step afterward.

The central technical idea is a shared memory bank that stores the visual appearance of characters and the sound of their voices after each generated scene, then uses that stored information to condition each new scene that follows. This is what allows a five-minute video to maintain recognizable characters across many different shots. The system also uses a distillation technique to speed up the slow diffusion-based generation process by roughly 7.5 times compared to the original approach.

The project is described as inference-only, meaning it includes pre-trained model weights and the code to run them, but not the code or data used to train the model from scratch. The model weights total around 70 gigabytes across the main model file and a text-understanding component from Google called Gemma. Running the system requires a modern NVIDIA GPU with CUDA support and substantial video memory.

Generating a video starts with a JSON file listing one or more shot descriptions. The README recommends running your initial idea through a provided prompt-enhancer prompt before writing the final shot descriptions, because bare short prompts produce noticeably weaker results. Each shot description should cover the roles and subjects in the scene, the environment, the action, audio elements, the camera angle, and the desired mood.

The release comes from JD.com's open-source team and is accompanied by a research paper. Human evaluation results in the paper show it outperforming another JD model called HappyOyster on long-form video across visual quality, audio quality, and prompt following. The full README is longer than what was shown.

Open on GitHub → Full breakdown on explaingit →