omlx

Python ★ 17k updated 6h ago

LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar

oMLX runs AI language models locally on Apple Silicon Macs with a persistent KV cache that survives restarts, so long sessions stay fast without sending data to external servers.

PythonMLXsetup: moderatecomplexity 3/5

oMLX is a program for running large language models directly on Apple Silicon Macs, the M1 through M4 chips. A large language model is the kind of AI that powers chat assistants and coding tools. Instead of sending your text to a company's servers, oMLX runs the model on your own machine and answers requests locally. You manage it from the macOS menu bar or from a command line tool.

The main problem it tries to solve is reusing past work. When an AI model reads a long conversation, it builds up internal data called a KV cache. oMLX keeps this cache in two places: a hot tier in fast memory and a cold tier on the SSD. When memory fills up, older pieces move to disk and get restored later instead of being recalculated, even after the server restarts. The goal is to make local models practical for real coding sessions with tools such as Claude Code.

It can serve several kinds of models at once: text models, vision models that read images, OCR models that read text from pictures, embedding models, and rerankers. Any app that expects an OpenAI-style connection can point at the local address and start using it. There is also a built-in chat page in the browser for talking to a loaded model directly.

oMLX includes an admin dashboard in the browser for watching activity in real time, loading or unloading models, running benchmarks, and changing per-model settings. It can pin frequently used models in memory, drop the least recently used ones when space runs low, and set an idle timeout per model. Settings can be changed without restarting the server.

Installation options include a downloadable Mac app with one-click updates, a Homebrew package that can run as a background service, or building from source. It requires macOS 15 or later, Python 3.10 or later, and an Apple Silicon chip. The project is shared under the Apache 2.0 license.

Where it fits

Run an AI coding assistant locally on a Mac without sending your code to cloud servers.
Host an OpenAI-compatible local API endpoint so tools like Cursor or Claude Code use your own models.
Serve vision, OCR, and embedding models simultaneously from a single local server managed via a browser dashboard.
Manage and monitor multiple AI models from an admin dashboard without restarting the server.

Open on GitHub → Full breakdown on explaingit →