ComfyUI_JoyAI_Echo

Python ★ 45 updated 2d ago

Pushing the Frontier of Long Video Generation Standalone, inference-only release for minute-level multi-shot audio-video generation with a distilled DMD generator, paired cross-modal memory, and story-level consistency.

A ComfyUI plugin that adds a video-plus-audio generation model from JD, letting you produce multi-minute videos from text prompts on a consumer GPU with as little as 6 GB of video memory.

PythonComfyUIGGUFsetup: hardcomplexity 4/5

This is a plugin for ComfyUI, a visual workflow tool used to run AI image and video models on your own computer. The plugin adds support for JoyAI-Echo, a video generation system developed by JD (a large Chinese tech company) that can produce videos up to several minutes long, with synchronized audio, from a text description.

What makes it notable is the low hardware bar. According to the README, a graphics card with just 6 GB of video memory can generate a five-minute video at 848 by 512 pixels. That is unusually accessible for long-video AI work, which normally demands much more powerful hardware. The plugin achieves this partly by supporting compressed model files in the GGUF format, which trade a small amount of quality for much lower memory use.

To use it, you clone the plugin into ComfyUI's custom nodes folder, install the Python dependencies, and then download several large model files from Hugging Face. The file layout the README describes includes a video model, separate audio and video compression models, and a language model that processes your text prompts. Once those are in place, you connect the nodes in ComfyUI's visual editor and run inference.

The current release is inference-only, meaning you can generate videos but not train or fine-tune the underlying model yourself. Text-to-video works with the provided checkpoints; image-to-video requires a version 1.5 model that was still in training when the README was written. The project includes one example workflow image to get started. The README has a mix of Chinese and English notes, with the technical instructions in English.

Where it fits

Generate a five-minute AI video with synchronized audio from a text description on a consumer GPU with only 6 GB of video memory.
Add JoyAI-Echo video generation nodes to an existing ComfyUI workflow to produce long-form video output alongside other AI image tools.
Use GGUF compressed model files to run video generation on hardware that would otherwise lack enough memory for full-precision models.

Open on GitHub → Full breakdown on explaingit →