ds4
DeepSeek 4 Flash and PRO local inference engine for Metal, CUDA and ROCm
A dedicated local inference engine for running the DeepSeek V4 Flash language model on high-end personal hardware, with disk-based key-value cache that enables million-token context windows.
DwarfStar 4 is a self-contained inference engine built specifically for running DeepSeek V4 Flash, a large AI language model, on local hardware. Unlike general-purpose runtimes that handle many different models, this project does exactly one thing: run this one model as correctly and efficiently as possible. It was created by antirez, also known as the creator of Redis.
The project targets high-end personal machines. On macOS, it uses Metal, the graphics API built into Apple hardware, and requires a MacBook or Mac Studio with at least 96GB of RAM. On Linux, it supports NVIDIA CUDA with particular attention to the DGX Spark. AMD ROCm support exists on a separate branch maintained by community contributors. A CPU-only build is available for diagnostics but not for regular use.
One of the key design ideas is that DeepSeek V4 Flash has a compressed key-value cache, which is the part of memory an AI model uses to keep track of earlier conversation context. This compression is small enough that the project stores the cache on disk rather than in RAM, allowing very long context windows (up to 1 million tokens) on machines that would otherwise not have enough memory.
The README lists several reasons the authors consider this model worth a dedicated engine: it is fast due to fewer active parameters, its thinking process scales in length with problem difficulty (short for simple questions, longer for complex ones), and it works well with aggressive 2-bit quantization without major quality loss.
The project is in alpha state and was built with significant assistance from GPT-5.5. It only works with GGUF files produced specifically for this engine. The README is longer than what was shown.
Where it fits
- Run the DeepSeek V4 Flash language model locally on a Mac with 96GB RAM for private, offline AI without cloud API costs.
- Use a local AI model with very long conversation memory by offloading the key-value cache to disk instead of RAM.
- Run an AI reasoning model on a local NVIDIA GPU without depending on cloud APIs or rate limits.