gitmyhub

annoy

C++ ★ 14k updated 7mo ago

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk

Annoy is a Spotify-built library for fast approximate nearest neighbor search that finds similar items in a large dataset by comparing feature vectors, with indexes saved to disk and shared across many processes.

C++PythonGoLuasetup: easycomplexity 3/5

Annoy is a library built at Spotify to answer one specific question quickly: given a large collection of items described as lists of numbers, which items in that collection are most similar to a particular item? This kind of search is called nearest neighbor search, and it shows up in music recommendations, product suggestions, image matching, and many similar problems.

The library takes an approximate approach, meaning it trades a small amount of accuracy for a large gain in speed. You build an index from your data, and then at query time you can ask for the closest matches. The key feature that sets Annoy apart from similar tools is that it saves indexes to disk as static files and loads them back using a technique called memory mapping, which lets many different processes read the same index file at once without duplicating it in memory. This makes it practical in environments where you want to share a pre-built index across many workers, such as a batch processing cluster.

Building the index and searching it are treated as separate steps. Once an index is built and saved, you cannot add more items to it. This is a deliberate trade-off: keeping the index immutable makes it possible to memory-map it safely. You control accuracy versus speed with two parameters: the number of trees built at index time (more trees means better results and a larger file) and the number of nodes inspected at search time (more nodes means better results but slower queries).

Python is the primary supported language for most users, with the library available via pip. The underlying code is written in C++ and can be used directly from C++ as well. Bindings also exist for Go and Lua. Supported distance metrics include Euclidean, Manhattan, cosine, Hamming, and dot product. The README notes it works best with fewer than a few hundred dimensions, though it can handle up to around a thousand.

Where it fits