inference

Python ★ 9.4k updated 5h ago

Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source, speech, and multimodal models on cloud, on-prem, or your laptop — all through one unified, production-ready inference API.

Run open-source AI models on your own hardware with a single API that's compatible with OpenAI, so you can switch from paid cloud AI to a self-hosted model by changing one line of code.

PythonvLLMllama.cpppipsetup: moderatecomplexity 3/5

Xinference (short for Xorbits Inference) is a Python library that makes it straightforward to run open-source AI models on your own hardware, whether that is a laptop, a company server, or a cloud machine. The goal is to give you a single API that works the same way regardless of which model you pick or where you run it. If you are already using OpenAI's API in your application, switching to a locally hosted model can be done by changing one line of code, because Xinference exposes an OpenAI-compatible interface.

The library supports text generation models (the large language models you chat with), speech recognition, image generation, text embedding, and multimodal models that can process both text and images. It can run models using several different back-end engines, including vLLM and llama.cpp, and it can spread a single large model across multiple GPUs or machines when the model is too big for one device.

Installation is through pip, the standard Python package manager. Once installed, you can launch a server with a single command and then load models through a web interface, a command-line tool, or the API. The web UI shows which models are running, lets you start or stop them, and provides a built-in chat window for testing. Automatic batching groups multiple incoming requests together so the hardware is used more efficiently under load.

Xinference integrates with several popular AI application frameworks, including LangChain, LlamaIndex, Dify, and RAGFlow. These are tools that developers use to build chatbots, document question-answering systems, and other AI-powered products. Because Xinference handles the model serving layer, those frameworks can point to it instead of a paid cloud API.

An enterprise edition with additional support is available from the company behind the project. The open-source version is free and covers the core serving functionality described above.

Where it fits

Replace OpenAI API calls in your app with a locally hosted model by changing the base URL in your code.
Run image generation, speech recognition, or text embedding models on your own server without paying per call.
Spread a large AI model across multiple GPUs when it is too big to fit on a single device.
Point LangChain or LlamaIndex at your local Xinference server instead of a paid cloud API to build chatbots or document Q&A apps.

Open on GitHub → Full breakdown on explaingit →