gitmyhub

llama.cpp

C++ ★ 116k updated 2h ago

LLM inference in C/C++

Run large language models locally on your computer or server using optimized C++ code, with no heavy dependencies or external APIs required.

CC++CUDAMetalVulkanARM NEONHIPsetup: moderatecomplexity 3/5

llama.cpp is a tool for running large language models (LLMs — the kind of AI that powers chat assistants) on your own machine instead of calling a cloud service. The project's stated goal is to enable LLM inference (the step where the model actually produces answers) with minimal setup and strong performance across a wide range of hardware, both locally and in the cloud. Technically, it is a plain C and C++ implementation with no external dependencies. The README highlights that Apple Silicon is treated as a first-class target, with optimizations through ARM NEON, Accelerate and Metal; that x86 chips are accelerated through AVX, AVX2, AVX512 and AMX instruction sets; and that RISC-V chips are also supported. NVIDIA GPUs are supported through custom CUDA kernels, AMD GPUs through HIP, and there are Vulkan and SYCL backends as well. To make models small enough to fit on consumer hardware, the project supports integer quantization at 1.5-bit through 8-bit precision, which shrinks models and speeds them up at some accuracy cost. It can also split work between CPU and GPU so that models larger than your GPU memory can still run, just more slowly. A long list of model families is supported, including LLaMA, Mistral, Mixtral, Gemma, Qwen, Phi, DeepSeek and many more. You would use llama.cpp if you want to run an open-weights chat model on your laptop or server without sending data to an external API, if you want to embed local model inference into your own application through its libllama library, or if you want an OpenAI-compatible API server you control via the bundled llama-server. It can be installed via brew, nix or winget, run from prebuilt binaries, used through Docker, or built from source.

Where it fits