Unified Model Serving Framework 🍱 Build model inference APIs and multi-model serving systems with any open-source or custom AI models. 👉 Join our forum! What is BentoML? BentoML is a…
Unified Model Serving Framework
🍱 Build model inference APIs and multi-model serving systems with any open-source or custom AI models. 👉 Join our forum!




What is BentoML?
BentoML is a Python library for building online serving systems optimized for AI apps and model inference.
- 🍱 Easily build APIs for Any AI/ML Model. Turn any model inference script into a REST API server with just a few lines of code and standard Python type hints.
- 🐳 Docker Containers made simple. No more dependency hell! Manage your environments, dependencies and model versions with a simple config file. BentoML automatically generates Docker images, ensures reproducibility, and simplifies how you deploy to different environments.
- 🧭 Maximize CPU/GPU utilization. Build high performance inference APIs leveraging built-in serving optimization features like dynamic batching, model parallelism, multi-stage pipeline and multi-model inference-graph orchestration.
- 👩💻 Fully customizable. Easily implement your own APIs or task queues, with custom business logic, model inference and multi-model composition. Supports any ML framework, modality, and inference runtime.
- 🚀 Ready for Production. Develop, run and debug locally. Seamlessly deploy to production with Docker containers or BentoCloud.
Getting started
Install BentoML:
# Requires Python≥3.9
pip install -U bentoml
Define APIs in a service.py file.
python
import bentoml
@bentoml.service(
image=bentoml.images.Image(python_version="3.11").python_packages("torch", "transformers"),
)
class Summarization:
def __init__(self) -> None:
import torch
from transformers import pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
self.pipeline = pipeline('summarization', device=device)
@bentoml.api(batchable=True)
def summarize(self, texts: list[str]) -> list[str]:
results = self.pipeline(texts)
return [item['summary_text'] for item in results]
💻 Run locally
Install PyTorch and Transformers packages to your Python virtual environment.
bash
pip install torch transformers # additional dependencies for local run
Run the service code locally (serving at http://localhost:3000 by default):
bash
bentoml serve
You should expect to see the following output.
[INFO] [cli] Starting production HTTP BentoServer from "service:Summarization" listening on http://localhost:3000 (Press CTRL+C to quit)
[INFO] [entry_service:Summarization:1] Service Summarization initialized
Now you can run inference from your browser at http://localhost:3000 or with a Python script:
python
import bentoml
with bentoml.SyncHTTPClient('http://localhost:3000') as client:
summarized_text: str = client.summarize([bentoml.__doc__])[0]
print(f"Result: {summarized_text}")
🐳 Deploy using Docker
Run bentoml build to package necessary code, models, dependency configs into a Bento - the standardized deployable artifact in BentoML:
bash
bentoml build
Ensure Docker is running. Generate a Docker container image for deployment:
bash
bentoml containerize summarization:latest
Run the generated image:
bash
docker run --rm -p 3000:3000 summarization:latest
☁️ Deploy on BentoCloud
BentoCloud provides compute infrastructure for rapid and reliable GenAI adoption. It helps speed up your BentoML development process leveraging cloud compute resources, and simplify how you deploy, scale and operate BentoML in production.
Sign up for BentoCloud for personal access; for enterprise use cases, contact our team.
bash
# After signup, run the following command to create an API token:
bentoml cloud login
# Deploy from current directory:
bentoml deploy

For detailed explanations, read the Hello World example.
Examples
- LLMs: Llama 3.2, Mistral, DeepSeek Distil, and more.
- Image Generation: Stable Diffusion 3 Medium, Stable Video Diffusion, Stable Diffusion XL Turbo, ControlNet, and LCM LoRAs.
- Embeddings: SentenceTransformers and ColPali
- Audio: ChatTTS, XTTS, WhisperX, Bark
- Computer Vision: YOLO and ResNet
- Advanced examples: Function calling, LangGraph, CrewAI
Advanced topics
- Model composition
- Workers and model parallelization
- Adaptive batching
- GPU inference
- Distributed serving systems
- Concurrency and autoscaling
- Model loading and Model Store
- Observability
- BentoCloud deployment
Community
Get involved and join our Community Forum 💬, where thousands of AI/ML engineers help each other, contribute to the project, and talk about building AI products.
To report a bug or suggest a feature request, use
GitHub Issues.
Contributing
There are many ways to contribute to the project:
- Report bugs and "Thumbs up" on issues that are relevant to you.
- Investigate issues and review other developers' pull requests.
- Contribute code or documentation to the project by submitting a GitHub pull request.
- Check out the Contributing Guide and Development Guide to learn more.
- Share your feedback and discuss roadmap plans in our forum.
Usage tracking and feedback
The BentoML framework collects anonymous usage data that helps our community improve the product. Only BentoML's internal API calls are being reported. This excludes any sensitive information, such as user code, model data, model names, or stack traces. Here's the code used for usage tracking. You can opt-out of usage tracking by the --do-not-track CLI option:
bash
bentoml [command] --do-not-track
Or by setting the environment variable:
bash
export BENTOML_DO_NOT_TRACK=True
License
-
OpenLLM
Run any open-source LLMs, such as DeepSeek and Llama, as OpenAI compatible API endpoint in the cloud.
Python ★ 12k 5d agoExplain → -
BentoML
The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!
Python ★ 8.7k 18d agoExplain → -
Yatai ▣
Model Deployment at Scale on Kubernetes 🦄️
TypeScript ★ 843 22d agoExplain → -
BentoDiffusion
BentoDiffusion: A collection of diffusion models served with BentoML
Python ★ 387 1y agoExplain → -
llm-inference-handbook
Everything you need to know about LLM inference
TypeScript ★ 294 5d agoExplain → -
comfy-pack ▣
A comprehensive toolkit for reliably locking, packing and deploying environments for ComfyUI workflows.
Python ★ 219 7mo agoExplain → -
stable-diffusion-server ▣
Deploy Your Own Stable Diffusion Service
Python ★ 201 1y agoExplain → -
llm-optimizer
Benchmark and optimize LLM inference across frameworks with ease
Python ★ 192 9mo agoExplain → -
bentoctl ▣
Fast model deployment on any cloud 🚀
Python ★ 175 2y agoExplain → -
BentoVLLM
Self-host LLMs with vLLM and BentoML
Python ★ 169 3mo agoExplain → -
gallery
BentoML Example Projects 🎨
★ 143 1y agoExplain → -
BentoVoiceAgent
Build Phone Calling Voice Agent fully powered by open source models.
Python ★ 124 1y agoExplain → -
openllm-models
No description.
Python ★ 66 2mo agoExplain → -
CLIP-API-service
CLIP as a service - Embed image and sentences, object recognition, visual reasoning, image classification and reverse image search
Jupyter Notebook ★ 66 10mo agoExplain → -
BentoOCR
Turn any OCR models into online inference API endpoint 🚀 🌖
Python ★ 60 7mo agoExplain → -
llm-bench ▣
No description.
Python ★ 56 1y agoExplain → -
rag-tutorials
a series of tutorials implementing rag service with BentoML and LlamaIndex
Python ★ 48 1y agoExplain → -
transformers-nlp-service ▣
Online Inference API for NLP Transformer models - summarization, text classification, sentiment analysis and more
Python ★ 45 2y agoExplain → -
BentoChatTTS
No description.
Python ★ 29 1y agoExplain → -
BentoColPali
No description.
Python ★ 25 1y agoExplain → -
BentoLMDeploy
Self-host LLMs with LMDeploy and BentoML
Python ★ 22 5mo agoExplain → -
BentoCrewAI
Serving CrewAI Agent as REST API with BentoML, optionally with self-host open-source LLMs
Python ★ 22 1mo agoExplain → -
simple_di
Simple dependency injection framework for Python
Python ★ 21 2y agoExplain → -
yatai-image-builder
🐳 Build OCI images for Bentos in k8s
Go ★ 19 2mo agoExplain → -
BentoWhisperX
No description.
Python ★ 18 1y agoExplain → -
Fraud-Detection-Model-Serving ▣
Online model serving with Fraud Detection model trained with XGBoost on IEEE-CIS dataset
Jupyter Notebook ★ 18 3y agoExplain → -
google-cloud-run-deploy ▣
Fast model deployment on Google Cloud Run
Python ★ 17 2y agoExplain → -
yatai-deployment
🚀 Launching Bento in a Kubernetes cluster
Go ★ 17 1y agoExplain → -
aws-sagemaker-deploy ▣
Fast model deployment on AWS Sagemaker
Python ★ 16 2y agoExplain → -
Distributed-Visual-ChatGPT ⑂
Scalable Visual-ChatGPT deployment on Kubernetes - Distributed multi-model inference graph powered by BentoML
Python ★ 16 1d agoExplain → -
aws-lambda-deploy ▣
Fast model deployment on AWS Lambda
Python ★ 15 2y agoExplain → -
BentoSentenceTransformers
how to build a sentence embedding application using BentoML
Python ★ 15 11d agoExplain → -
sentence-embedding-bento ▣
Sentence Embedding as a Service
Jupyter Notebook ★ 15 11mo agoExplain → -
BentoTwilioConversationRelay
No description.
Python ★ 14 1y agoExplain → -
aws-ec2-deploy ▣
Fast model deployment on AWS EC2
Python ★ 14 2y agoExplain → -
BentoLangGraph
Serving LangGraph Agent as REST API with BentoML, optionally with self-host open-source LLMs
Python ★ 13 1y agoExplain → -
IF-multi-GPUs-demo ▣
No description.
Python ★ 13 3y agoExplain → -
BentoXTTS
how to build an text-to-speech application using BentoML
Python ★ 12 11d agoExplain → -
quickstart
BentoML Quickstart Example
Python ★ 11 1y agoExplain → -
diffusers-examples ▣
API serving for your diffusers models
Python ★ 11 2y agoExplain → -
BentoTRTLLM
No description.
Python ★ 10 4mo agoExplain → -
BentoBark
No description.
Python ★ 9 1mo agoExplain → -
BentoFunctionCalling
No description.
Python ★ 9 5mo agoExplain → -
BentoCLIP
building a CLIP application using BentoML
Python ★ 9 11d agoExplain → -
llm-router
Multi-LLM Routing API Endpoint with BentoML
Python ★ 8 1y agoExplain → -
BentoXTTSStreaming
xtts with streaming endpoint
Python ★ 8 1y agoExplain → -
Pneumonia-Detection-Demo
Pneumonia Detection - Healthcare Imaging Application built with BentoML and fine-tuned Vision Transformer (ViT) model
Python ★ 8 1y agoExplain → -
yatai-chart ▣
Helm Chart for installing Yatai on Kubernetes ⎈
Mustache ★ 7 3y agoExplain → -
BentoInfinity ▣
No description.
Python ★ 6 1y agoExplain → -
BentoMLflow
No description.
Python ★ 6 1y agoExplain → -
plugins ▣
the swish knife to all things bentoml.
Starlark ★ 6 2y agoExplain → -
heroku-deploy ▣
Deploy BentoML bundled models to Heroku
Python ★ 6 2y agoExplain → -
BentoBLIP
how to build an image captioning application on top of a BLIP model with BentoML
Python ★ 5 11d agoExplain → -
BentoResnet
No description.
Python ★ 5 1y agoExplain → -
google-compute-engine-deploy ▣
No description.
HCL ★ 5 2y agoExplain → -
containerize-push-action ▣
docker's build-and-push-action equivalent for bentoml
TypeScript ★ 4 2y agoExplain → -
BentoMLCLLM
No description.
Python ★ 3 1y agoExplain → -
BentoShield
No description.
Python ★ 3 1y agoExplain → -
BentoBurr ▣
This repository shows how to deploy a Burr Application with BentoML.
Python ★ 3 1y agoExplain → -
BentoTriton
No description.
Python ★ 3 10mo agoExplain → -
bentoml-arize-fraud-detection-workshop
No description.
Jupyter Notebook ★ 3 3y agoExplain → -
azure-container-instances-deploy ▣
Fast model deployment on Azure container instances
Python ★ 3 2y agoExplain → -
azure-functions-deploy ▣
Fast model deployment on Azure Functions
Python ★ 3 3y agoExplain → -
deploy-bento-action ▣
A GitHub Action to deploy bento to cloud
★ 2 2y agoExplain → -
helm-charts
No description.
★ 2 2mo agoExplain → -
bentocloud-homepage-news
No description.
★ 2 4mo agoExplain → -
BentoTGI
No description.
Python ★ 2 1y agoExplain → -
XGBoostDemo
No description.
Python ★ 2 10mo agoExplain → -
BentoMoshi
No description.
Python ★ 2 1y agoExplain → -
LLMGateway
No description.
Python ★ 2 1y agoExplain → -
BentoMoirai
No description.
Python ★ 2 1y agoExplain → -
BentoSGLang
No description.
Python ★ 1 2mo agoExplain → -
yatai-schemas
No description.
Go ★ 1 1mo agoExplain → -
BentoProphet
BentoML with Facebook Prophet
Python ★ 1 9mo agoExplain → -
yatai-common
No description.
Go ★ 1 3mo agoExplain → -
.github
✨🍱🦄️
★ 1 3mo agoExplain → -
bentoml-unsloth ▣
BentoML Unsloth integration
Python ★ 1 1y agoExplain → -
openllm-benchmark ▣
No description.
Python ★ 1 2y agoExplain → -
BentoXGBoost
No description.
Python ★ 1 1y agoExplain → -
kantoku ⑂
A Process & Socket Manager built with zmq
Python ★ 1 9mo agoExplain → -
bentoml-comfyui ▣
BentoML extension for ComfyUI
Python ★ 1 1y agoExplain → -
BentoQueue
No description.
Python ★ 1 1mo agoExplain → -
grafana-operator ⑂
An operator for Grafana that installs and manages Grafana instances, Dashboards and Datasources through Kubernetes/OpenShift CRs
Go ★ 1 4y agoExplain → -
openai_emulator
No description.
Python ★ 0 3mo agoExplain → -
mise-nebius
No description.
Shell ★ 0 5mo agoExplain → -
byoc
No description.
Shell ★ 0 9mo agoExplain → -
pipelines-workflows ⑂
Gruntwork Pipelines for BentoML
★ 0 8mo agoExplain → -
lago-go-client ⑂
Lago Go Client
Go ★ 0 8mo agoExplain → -
sglang ⑂
SGLang is a fast serving framework for large language models and vision language models.
★ 0 10d agoExplain → -
BentoIris ▣
how to build an Iris classification application using BentoML
Python ★ 0 2y agoExplain → -
bentocloud-configuration ▣
No description.
★ 0 2y agoExplain → -
workshops ▣
No description.
Jupyter Notebook ★ 0 2y agoExplain → -
build-bento-action ▣
Build your Bento on GitHub Action
Shell ★ 0 2y agoExplain → -
yatai-homepage-news ▣
No description.
★ 0 3y agoExplain → -
bentoml-feedstock ⑂ ▣
A conda-smithy repository for bentoml.
★ 0 3y agoExplain → -
BentoSpacy
No description.
Python ★ 0 4mo agoExplain → -
BentoGradio
No description.
Python ★ 0 1y agoExplain → -
BentoTinyCudaNN
No description.
Python ★ 0 1y agoExplain → -
BentoResnetTensorFlow
No description.
Python ★ 0 1y agoExplain → -
bentocloud-cicd-example
No description.
Python ★ 0 1y agoExplain → -
BentoLlamaCpp
BentoML + llama.cpp
Python ★ 0 10mo agoExplain → -
BentoDeepSpeedMII ▣
No description.
Python ★ 0 1y agoExplain → -
s5cmd ⑂
Parallel S3 and local filesystem execution tool.
Go ★ 0 1y agoExplain → -
asynq ⑂
Simple, reliable, and efficient distributed task queue in Go
Go ★ 0 10mo agoExplain → -
unsloth ⑂
Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
Python ★ 0 1y agoExplain → -
chatgpt-lite ⑂
Fast ChatGPT UI with support for both OpenAI and Azure OpenAI. 快速的ChatGPT UI,支持OpenAI和Azure OpenAI。
TypeScript ★ 0 1mo agoExplain → -
terraform-azure-modules ⑂
Azure verified modules for Terraform
★ 0 2y agoExplain → -
gorm ⑂
The fantastic ORM library for Golang, aims to be developer friendly
Go ★ 0 1y agoExplain → -
terraform-google-kubernetes-engine ⑂
Configures opinionated GKE clusters
HCL ★ 0 1mo agoExplain → -
papercups ⑂
Open-source live customer chat
Elixir ★ 0 1d agoExplain → -
yatai-deployment-chart ▣
No description.
Smarty ★ 0 3y agoExplain → -
helm-charts-devel
No description.
★ 0 2y agoExplain → -
csi-driver-image-populator ⑂
CSI driver that uses a container image as a volume
Go ★ 0 1mo agoExplain →
No repos match these filters.