gitmyhub

omlx

Python ★ 17k updated 6h ago

LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar

oMLX runs AI language models locally on Apple Silicon Macs with a persistent KV cache that survives restarts, so long sessions stay fast without sending data to external servers.

PythonMLXsetup: moderatecomplexity 3/5

oMLX is a program for running large language models directly on Apple Silicon Macs, the M1 through M4 chips. A large language model is the kind of AI that powers chat assistants and coding tools. Instead of sending your text to a company's servers, oMLX runs the model on your own machine and answers requests locally. You manage it from the macOS menu bar or from a command line tool.

The main problem it tries to solve is reusing past work. When an AI model reads a long conversation, it builds up internal data called a KV cache. oMLX keeps this cache in two places: a hot tier in fast memory and a cold tier on the SSD. When memory fills up, older pieces move to disk and get restored later instead of being recalculated, even after the server restarts. The goal is to make local models practical for real coding sessions with tools such as Claude Code.

It can serve several kinds of models at once: text models, vision models that read images, OCR models that read text from pictures, embedding models, and rerankers. Any app that expects an OpenAI-style connection can point at the local address and start using it. There is also a built-in chat page in the browser for talking to a loaded model directly.

oMLX includes an admin dashboard in the browser for watching activity in real time, loading or unloading models, running benchmarks, and changing per-model settings. It can pin frequently used models in memory, drop the least recently used ones when space runs low, and set an idle timeout per model. Settings can be changed without restarting the server.

Installation options include a downloadable Mac app with one-click updates, a Homebrew package that can run as a background service, or building from source. It requires macOS 15 or later, Python 3.10 or later, and an Apple Silicon chip. The project is shared under the Apache 2.0 license.

Where it fits