gitmyhub

MERIT

Python ★ 26 updated 16d ago

A Python tool that compares two audio files across melody, rhythm, and timbre separately, returning three independent similarity scores instead of a single blended number.

PythonPyTorchHuggingFacesetup: moderatecomplexity 3/5

MERIT is a Python tool for comparing audio files by three separate musical qualities: melody, rhythm, and timbre. Most music similarity systems return a single number that blends all these qualities together, making it impossible to tell why two songs scored as similar or different. MERIT separates them into three independent scores, so you can ask targeted questions like "does this cover share the same melody?" or "does this remix keep the same drum feel?" and get distinct answers for each.

The system works by running audio through a shared backbone model called MERT, which was pre-trained on large amounts of music and knows how to convert raw audio into numerical representations. MERIT then passes those representations through three small trained modules, one for melody, one for rhythm, and one for timbre, each producing a 128-number embedding. When you compare two audio clips, each module computes its own cosine similarity score, a number between -1 and 1 indicating how closely that particular quality matches.

A practical example from the README: if a solo piano plays a rock song note-for-note, the melody score will be high because the notes match, but the rhythm and timbre scores will be low because the piano phrasing and sound color differ from the original band. MERIT makes that distinction computable rather than subjective.

The pre-trained model weights are freely available on HuggingFace and total about 33 MB. Setting it up requires Python with PyTorch and a few related libraries. You download the three small projection heads, load any audio file, and get back three embedding vectors that you can compare against other songs. The README includes ready-to-run Python code for this workflow.

The training dataset, also available on HuggingFace, contains roughly 296,000 audio triplets where only one musical factor varies at a time, which is how the system learned to separate the three qualities. That dataset is for non-commercial use only. The model code itself is MIT licensed.

Where it fits