OpenLLM

Python ★ 12k updated 5d ago

Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.

OpenLLM lets you run open-source AI language models on your own hardware and serve them through an API that matches OpenAI's format, so any existing OpenAI-compatible app works with your self-hosted model without code changes.

PythonHugging FaceDockerKubernetesBentoMLBentoCloudsetup: hardcomplexity 4/5

OpenLLM is a Python tool that lets you run open-source language models on your own hardware and expose them through an API that matches the same format as OpenAI's API. The key idea is that software already built to work with OpenAI can point to your self-hosted model instead, with no code changes beyond swapping the server address.

To get started, you install the package with pip and run a single command such as "openllm serve llama3.2:1b". The tool fetches the model weights from Hugging Face, starts a local server at http://localhost:3000, and provides OpenAI-compatible endpoints right away. A built-in chat interface is available at the /chat URL, so you can test the model in a browser without writing any code.

OpenLLM does not store model weights itself. It downloads them from Hugging Face the first time you run a model. Some models require you to request access on Hugging Face and set an authentication token before they will download.

The supported model list runs from small models that fit on a consumer GPU with around 12GB of memory, such as Gemma at 2 billion parameters, up to very large ones requiring multiple high-end data center GPUs, such as DeepSeek R1 at 671 billion parameters. A companion GitHub repository tracks the full catalog, and you can add your own custom model repositories to extend what the tool can serve.

For production use, OpenLLM integrates with BentoCloud and supports packaging models as Docker containers or Kubernetes deployments. The project is built by BentoML and released under the Apache 2.0 license.

Where it fits

Run a local AI chat assistant using Llama or Gemma without paying OpenAI API fees.
Replace OpenAI API calls in an existing app by pointing it to your own self-hosted model server.
Package a model as a Docker container and deploy it to a cloud environment for production use.
Serve large open-source models on a GPU server and expose them to multiple team members via API.

Open on GitHub → Full breakdown on explaingit →