stable-audio-tools

Python ★ 3.8k updated 1d ago

Generative models for conditional audio generation

Stable Audio Tools is a Python library from Stability AI that contains the training and inference code for their audio generation models. These are AI models that take a text description or other input and produce audio output. The repository is the technical foundation behind Stable Audio, their publicly released audio generation product.

For someone who just wants to try out a pre-trained model without training anything themselves, the README describes a web-based interface built with a tool called Gradio. You run a single command pointing it at a model hosted on Hugging Face and get a local interface in your browser where you can type prompts and hear the generated audio. The pre-trained model it uses as an example is called stable-audio-open-1.0, which requires accepting a license agreement on Hugging Face before downloading.

For people who want to train their own models or fine-tune an existing one, the library uses a framework called PyTorch Lightning to handle multi-GPU and multi-node training. Training is configured through JSON files that define the model architecture, the audio format (sample rate, mono vs stereo, clip length), and the training dataset. Datasets can come from a local folder of audio files or from cloud storage on Amazon S3. Training progress is logged to Weights and Biases, a service for tracking machine learning experiments, so an account there is required.

One practical detail the README explains is the difference between "wrapped" and "unwrapped" model checkpoints. During training, the saved files include optimizer state and other training-only data that bloat the file size. The repository includes a script to strip all of that out and save a smaller file suitable for inference or fine-tuning.

The library requires Python 3.10 and PyTorch 2.5 or later.

Open on GitHub → Full breakdown on explaingit →