HoliTok
HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding
A research library from Shanghai Jiao Tong University that compresses audio files into compact representations at 48 kHz quality and extracts semantic speech features for training or understanding speech AI models.
HoliTok is a research library for converting audio into compact numerical representations and back again. It is designed for speech processing tasks: given an audio file, it encodes the audio into a compressed format called latents, and can reconstruct audio from those latents or extract higher-level features that capture the meaning of what was said.
The system uses a VAE (variational autoencoder), a type of model that learns to compress data into a smaller representation space. HoliTok operates at 48 kHz audio quality, which is higher than typical speech models. Two pre-trained model variants are available: HoliTok-Base and HoliTok-Unite. Both download their weights automatically from Hugging Face on first use.
The library has three main operations. Encoding converts a .wav file into a latents file. Semantic feature extraction takes those latents and produces a 1536-dimensional feature vector per time step, intended to capture the content of speech rather than its acoustic details. Reconstruction takes the latents and produces a new .wav file. All three operations are available as Python API calls, command-line commands, or environment-variable-driven shell scripts for batch jobs.
Practical uses include training speech generation models (where you work with compressed audio representations rather than raw waveforms), building speech understanding systems (where the semantic features serve as input to a classifier or language model), or researching audio compression and reconstruction quality.
The library requires Python 3.10 or newer and PyTorch 2.8 with CUDA. It is published alongside a research paper on arXiv from a team at Shanghai Jiao Tong University, covering the dual capabilities of the tokenization approach for both generating and understanding speech.
Where it fits
- Compress audio files into compact latent representations to use as training data for a speech generation model.
- Extract 1536-dimensional semantic speech features from a recording to feed into a text classifier or language model.
- Reconstruct audio from stored latents to measure the quality of a compressed speech representation.