ACE-Step
ACE-Step: A Step Towards Music Generation Foundation Model
ACE-Step is an open-source AI model for music generation, built around the idea of creating a foundation model — a general-purpose base that other music AI tools can be built on top of, similar to what Stable Diffusion did for images. It is designed to overcome trade-offs common in existing music AI: some models are good at aligning lyrics to melody but slow; others generate quickly but lack long-range musical coherence.
The model combines diffusion-based generation — a technique where audio is gradually refined from noise — with a Deep Compression AutoEncoder called DCAE and a lightweight linear transformer. This allows it to generate up to 4 minutes of music in 20 seconds on an A100 GPU, which the authors describe as 15 times faster than comparable approaches, while maintaining coherence across melody, harmony, and rhythm. It supports 19 languages and a broad range of music styles and genres.
Beyond basic text-to-music generation, ACE-Step supports several fine-grained control mechanisms: voice cloning, lyric editing (changing specific words in a song while preserving the rest), generating variations of existing audio, and track generation modes such as lyric2vocal and singing2accompaniment. These capabilities are available through LoRA fine-tuned variants — smaller, specialized models trained on top of the base. A ComfyUI integration is also available.
Memory requirements have been reduced to a maximum of 8GB of VRAM, making it more compatible with consumer-grade hardware. The project is written in Python and released as open source. The full README is longer than what was provided.