llama-cpp-python

Python ★ 10k updated 23h ago

Python bindings for llama.cpp

A Python package that runs large language models entirely on your own machine using the llama.cpp engine, with an OpenAI-compatible API so existing code can switch to local, private models with minimal changes.

PythonC++CUDAMetalROCmVulkanOpenBLASsetup: moderatecomplexity 3/5

llama-cpp-python is a Python package that lets you run large language models locally on your own machine, without sending data to any external service. It works by wrapping llama.cpp, a popular C++ library for running AI language models efficiently. Once installed, you can load a model file and generate text completions entirely offline.

The package offers two levels of access. The low-level interface gives direct access to the underlying C library functions for developers who need fine-grained control. The high-level interface provides a Python API designed to look like OpenAI's API, so existing code written against OpenAI can often be pointed at a local model with minimal changes. It also integrates with LangChain and LlamaIndex for use in AI application frameworks.

A built-in web server mode starts a local HTTP server that speaks the OpenAI REST API format, which means any tool or application that can talk to OpenAI can be redirected to a locally-running model instead. The server supports function calling, multimodal (vision) inputs, and running multiple models behind one endpoint.

Installation is a single pip command, but it compiles llama.cpp from source during install, so a C compiler is required (gcc or clang on Linux/Mac, Visual Studio or MinGW on Windows). Pre-built wheels are available for CPU, NVIDIA CUDA, and Apple Silicon Metal to skip compilation. Other supported hardware acceleration backends include AMD ROCm, Vulkan, Intel SYCL, and OpenBLAS.

The package supports Python 3.8 and above and runs on Linux, macOS, and Windows. Documentation is hosted at readthedocs.io.

Where it fits

Run an AI language model locally on a laptop and query it with the same Python code you use for OpenAI.
Start a local HTTP server that any OpenAI-compatible tool or app can use instead of OpenAI, at zero API cost.
Integrate a local LLM into a LangChain or LlamaIndex app without sending data to an external service.
Run a multimodal vision model locally to analyze images without cloud costs or privacy concerns.

Open on GitHub → Full breakdown on explaingit →