gitmyhub

Kimi-Audio

Python ★ 4.7k updated 1y ago

Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation

Kimi-Audio is an open-source AI model from Moonshot AI that can listen to audio and respond to it, either in text, in spoken audio, or both. The model handles a wide range of audio tasks in a single system: transcribing speech, answering questions about what it heard, describing sounds, detecting emotions in speech, classifying sounds and acoustic scenes, and carrying on back-and-forth spoken conversations.

The model was trained on over 13 million hours of audio data covering speech, music, and general sounds, as well as text data. This large training base allows it to reason about what it hears and understand language at the same time. The architecture combines a component that converts audio into numerical representations, a large language model core (based on Qwen 2.5 7B) that processes those representations along with text, and a component that converts generated audio tokens back into audible speech with low latency.

Two model versions are available on Hugging Face: Kimi-Audio-7B (the base pretrained model) and Kimi-Audio-7B-Instruct (the version tuned to follow instructions and hold conversations). The instruct version is what most users would interact with. The repository provides Python code for running the model, including examples for audio transcription and multi-turn spoken conversation. Installation is done through pip, and model weights are downloaded from Hugging Face.

Fine-tuning examples are also included for developers who want to adapt the model to specific domains or tasks. A separate evaluation toolkit called Kimi-Audio-Evalkit is published to let researchers reproduce the benchmark numbers reported in the technical paper.

The technical report is available on arXiv. The project is intended for research use and for developers building audio-understanding or voice-conversation applications.