Foley-Omni

Python ★ 22 updated 14d ago

Foley-Omni: a unified multimodal audio generation model for task-level synthesis and complete video soundtrack generation, producing speech, sound effects, and music conditioned on text and video.

AI research tool from Nanjing University that generates synchronized soundtracks (speech, sound effects, and music together) for silent video clips using a text prompt to describe what you want to hear.

PythonPyTorchCUDAFlashAttentionHugging FaceYAMLJSONMMAudiosetup: hardcomplexity 4/5

Foley-Omni is a Python research project from Nanjing University that generates audio for silent or muted videos using an AI model. Given a video clip and a text description, the model produces a complete soundtrack containing speech, sound effects, and background music together, all synchronized with what is happening on screen. This kind of task, sometimes called video-to-soundtrack generation, is the main focus of the project.

The text prompt fed to the model uses a structured format with three optional blocks. A WORDS block specifies what speech should be spoken. An AUDIO_CAPTION block describes ambient sounds, events, and speaker characteristics. A MUSIC block specifies music style, mood, instruments, and tempo. You can include any combination of the three, so you can generate only sound effects, only music, only speech, or all three at once. The model also supports text-only generation without any video input.

The current public checkpoint is designed for videos up to 10 seconds long. Running inference involves setting up a YAML config file that points to input videos and their prompt data, then running a Python inference script. The output is an MP4 file with the generated audio merged in. A batch mode accepts a JSON manifest listing multiple videos. Visual features can be pre-extracted to speed up repeated inference on the same footage.

Installation requires Python 3.10, CUDA 12.4, PyTorch 2.6, and FlashAttention. Model weights are downloaded from Hugging Face and consist of several components: the Foley-Omni checkpoint itself, a text encoder from the Wan2.2 video model, and pre-trained audio components from MMAudio. The total download is substantial.

This is a research code release accompanying an arXiv paper. A benchmark dataset (V2ST-Bench) and a Hugging Face demo are listed as coming soon. No license is stated in the README.

Where it fits

Add a realistic soundtrack to a silent video clip by describing the sounds, speech, and music you want in a text prompt.
Generate background music and ambient sound effects for short video content without recording real audio.
Research and benchmark AI models that automatically sync audio to video for academic or experimental purposes.
Produce voiceover speech combined with background music for a video scene using a single AI inference run.

Open on GitHub → Full breakdown on explaingit →