<!-- Copyright 2018-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are…
<!--
Copyright 2018-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
#
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:
* Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
* Neither the name of NVIDIA CORPORATION nor the names of its
contributors may be used to endorse or promote products derived
from this software without specific prior written permission.
#
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS `AS IS'' AND ANY
EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-->

>[!WARNING]
>You are currently on the main` branch which tracks under-development progress
>towards the next release. The current release is version 2.69.0
>and corresponds to the 26.05 container release on NVIDIA GPU Cloud (NGC).
Triton Inference Server
Triton Inference Server is an open source inference serving software that
streamlines AI inferencing. Triton enables teams to deploy any AI model from
multiple deep learning and machine learning frameworks, including TensorRT,
PyTorch, ONNX, OpenVINO, Python, RAPIDS FIL, and more. Triton
Inference Server supports inference across cloud, data center, edge and embedded
devices on NVIDIA GPUs, x86 and ARM CPU, or AWS Inferentia. Triton Inference
Server delivers optimized performance for many query types, including real time,
batched, ensembles and audio/video streaming. Triton inference Server is part of
NVIDIA AI Enterprise,
a software platform that accelerates the data science pipeline and streamlines
the development and deployment of production AI.
Major features include:
frameworks frameworks- [Concurrent model
- [Dynamic batching](docs/user_guide/batcher.md#dynamic-batcher)
- [Sequence batching](docs/user_guide/batcher.md#sequence-batcher) and
- Provides Backend API that
- Supports writing custom backends in python, a.k.a.
- Model pipelines using
- [HTTP/REST and GRPC inference
- A [C API](docs/customization_guide/inprocess_c_api.md) and
- [Metrics](docs/user_guide/metrics.md) indicating GPU utilization, server
New to Triton Inference Server? Make use of
these tutorials
to begin your Triton journey!
Join the Triton and TensorRT community and
stay current on the latest product updates, bug fixes, content, best practices,
and more. Need enterprise support? NVIDIA global support is available for Triton
Inference Server with the
NVIDIA AI Enterprise software suite.
Serve a Model in 3 Easy Steps
bash
# Step 1: Create the example model repository
git clone -b r26.05 https://github.com/triton-inference-server/server.git
cd server/docs/examples
./fetch_models.sh
# Step 2: Launch triton from the NGC Triton container
docker run --gpus=1 --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:26.05-py3 tritonserver --model-repository=/models --model-control-mode explicit --load-model densenet_onnx
# Step 3: Sending an Inference Request
# In a separate console, launch the image_client example from the NGC Triton SDK container
docker run -it --rm --net=host nvcr.io/nvidia/tritonserver:26.05-py3-sdk /workspace/install/bin/image_client -m densenet_onnx -c 3 -s INCEPTION /workspace/images/mug.jpg
# Inference should return the following
Image '/workspace/images/mug.jpg':
15.346230 (504) = COFFEE MUG
13.224326 (968) = CUP
10.422965 (505) = COFFEEPOT
Please read the [QuickStart](docs/getting_started/quickstart.md) guide for additional information
regarding this example. The quickstart guide also contains an example of how to launch Triton on [CPU-only systems](docs/getting_started/quickstart.md#run-on-cpu-only-system). New to Triton and wondering where to get started? Watch the Getting Started video.
Examples and Tutorials
Check out NVIDIA LaunchPad
for free access to a set of hands-on labs with Triton Inference Server hosted on
NVIDIA infrastructure.
Specific end-to-end examples for popular models, such as ResNet, BERT, and DLRM
are located in the
NVIDIA Deep Learning Examples
page on GitHub. The
NVIDIA Developer Zone
contains additional documentation, presentations, and examples.
Documentation
Build and Deploy
The recommended way to build and use Triton Inference Server is with Docker
images.
- [Install Triton Inference Server with Docker containers](docs/customization_guide/build.md#building-with-docker) (*Recommended*)
- [Install Triton Inference Server without Docker containers](docs/customization_guide/build.md#building-without-docker)
- [Build a custom Triton Inference Server Docker container](docs/customization_guide/compose.md)
- [Build Triton Inference Server from source](docs/customization_guide/build.md#building-on-unsupported-platforms)
- Examples for deploying Triton Inference Server with Kubernetes and Helm on [GCP](deploy/gcp/README.md),
- [Secure Deployment Considerations](docs/customization_guide/deploy.md)
Using Triton
Preparing Models for Triton Inference Server
The first step in using Triton to serve your models is to place one or
more models into a [model repository](docs/user_guide/model_repository.md). Depending on
the type of the model and on what Triton capabilities you want to enable for
the model, you may need to create a [model
configuration](docs/user_guide/model_configuration.md) for the model.
- [Add custom operations to Triton if needed by your model](docs/user_guide/custom_operations.md)
- Enable model pipelining with [Model Ensemble](docs/user_guide/architecture.md#ensemble-models)
- Optimize your models setting [scheduling and batching](docs/user_guide/architecture.md#models-and-schedulers)
- Use the Model Analyzer tool
- Learn how to [explicitly manage what models are available by loading and
Configure and Use Triton Inference Server
- Read the [Quick Start Guide](docs/getting_started/quickstart.md) to run Triton Inference
- Triton supports multiple execution engines, called
- Not all the above backends are supported on every platform supported by Triton.
- Learn how to [optimize performance](docs/user_guide/optimization.md) using the
- Learn how to [manage loading and unloading models](docs/user_guide/model_management.md) in
- Send requests directly to Triton with the [HTTP/REST JSON-based
Client Support and Examples
A Triton *client* application sends inference and other requests to Triton. The
Python and C++ client libraries
provide APIs to simplify this communication.
- Review client examples for C++,
- Configure HTTP
- Send input data (e.g. a jpeg image) directly to Triton in the body of an HTTP
Extend Triton
[Triton Inference Server's architecture](docs/user_guide/architecture.md) is specifically
designed for modularity and flexibility
- [Customize Triton Inference Server container](docs/customization_guide/compose.md) for your use case
- Create custom backends
- Create [decoupled backends and models](docs/user_guide/decoupled_models.md) that can send
- Use a [Triton repository agent](docs/customization_guide/repository_agents.md) to add functionality
- Deploy Triton on [Jetson and JetPack](docs/user_guide/jetson.md)
- Use Triton on AWS
Additional Documentation
- [FAQ](docs/user_guide/faq.md)
- [User Guide](docs/README.md#user-guide)
- [Customization Guide](docs/README.md#customization-guide)
- Release Notes
- GPU, Driver, and CUDA Support
Contributing
Contributions to Triton Inference Server are more than welcome. To
contribute please review the [contribution
guidelines](CONTRIBUTING.md). If you have a backend, client,
example or similar contribution that is not modifying the core of
Triton, then you should file a PR in the contrib
repo.
Reporting problems, asking questions
We appreciate any feedback, questions or bug reporting regarding this project.
When posting issues in GitHub,
follow the process outlined in the Stack Overflow document.
Ensure posted examples are:
- minimal – use as little code as possible that still produces the
- complete – provide all parts needed to reproduce the problem. Check
less time we spend on reproducing problems the more time we have to
fix it
- verifiable – test the code you're about to provide to make sure it
related to your request/question.
For issues, please use the provided bug report and feature request templates.
For questions, we recommend posting in our community
GitHub Discussions.
For more information
Please refer to the NVIDIA Developer Triton page
for more information.
-
server
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
Python ★ 11k 2d agoExplain → -
tensorrtllm_backend
The Triton TensorRT-LLM Backend
★ 935 10d agoExplain → -
pytriton
PyTriton is a Flask/FastAPI-like interface that simplifies Triton's deployment in Python environments.
Python ★ 844 10mo agoExplain → -
tutorials
This repository contains tutorials and examples for Triton Inference Server
Python ★ 842 11d agoExplain → -
client
Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.
C++ ★ 695 4d agoExplain → -
python_backend
Triton backend that enables pre-process, post-processing and other logic to be implemented in Python.
C++ ★ 678 5d agoExplain → -
model_analyzer
Triton Model Analyzer is a CLI tool to help with better understanding of the compute and memory requirements of the Triton Inference Server models.
Python ★ 516 4d agoExplain → -
fastertransformer_backend
No description.
Python ★ 413 2y agoExplain → -
backend
Common source, scripts and utilities for creating Triton backends.
C++ ★ 375 11d agoExplain → -
vllm_backend
No description.
Python ★ 346 11d agoExplain → -
model_navigator
Triton Model Navigator is an inference toolkit designed for optimizing and deploying Deep Learning models with a focus on NVIDIA GPUs.
Python ★ 223 24d agoExplain → -
onnxruntime_backend
The Triton backend for the ONNX Runtime.
C++ ★ 179 5d agoExplain → -
pytorch_backend
The Triton backend for the PyTorch TorchScript models.
C++ ★ 179 4d agoExplain → -
core
The core library and APIs implementing the Triton Inference Server.
C++ ★ 174 2d agoExplain → -
perf_analyzer
No description.
Python ★ 146 11d agoExplain → -
dali_backend
The Triton backend that allows running GPU-accelerated data pre-processing pipelines implemented in DALI's python API.
C++ ★ 145 3d agoExplain → -
fil_backend
FIL backend for the Triton Inference Server
Jupyter Notebook ★ 92 8d agoExplain → -
tensorrt_backend
The Triton backend for TensorRT.
C++ ★ 89 9d agoExplain → -
common
Common source, scripts and utilities shared across all Triton repositories.
C++ ★ 80 2d agoExplain → -
triton_cli
Triton CLI is an open source command line interface that enables users to create, deploy, and profile models served by the Triton Inference Server.
Python ★ 74 11d agoExplain → -
hugectr_backend
No description.
Jupyter Notebook ★ 57 2y agoExplain → -
tensorflow_backend
The Triton backend for TensorFlow.
C++ ★ 56 7mo agoExplain → -
openvino_backend
OpenVINO backend for Triton.
C++ ★ 38 11d agoExplain → -
paddlepaddle_backend
No description.
C++ ★ 36 2y agoExplain → -
developer_tools
No description.
C++ ★ 22 11d agoExplain → -
stateful_backend
Triton backend for managing the model state tensors automatically in sequence batcher
C++ ★ 17 2y agoExplain → -
redis_cache
TRITONCACHE implementation of a Redis cache
C++ ★ 17 11d agoExplain → -
checksum_repository_agent
The Triton repository agent that verifies model checksums.
C++ ★ 14 11d agoExplain → -
contrib
Community contributions to Triton that are not officially supported or maintained by the Triton project.
Python ★ 8 2y agoExplain → -
repeat_backend
An example Triton backend that demonstrates sending zero, one, or multiple responses for each request.
C++ ★ 7 11d agoExplain → -
third_party
Third-party source packages that are modified for use in Triton.
C ★ 6 4d agoExplain → -
identity_backend
Example Triton backend that demonstrates most of the Triton Backend API.
C++ ★ 6 11d agoExplain → -
local_cache
Implementation of a local in-memory cache for Triton Inference Server's TRITONCACHE API
C++ ★ 6 11d agoExplain → -
TensorRT-LLM ⑂
TensorRT LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and supports state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT LLM also contains components to create Python and C++ runtimes that orchestrate the inference execution in a performant way.
Python ★ 3 5h agoExplain → -
square_backend
Simple Triton backend used for testing.
C++ ★ 3 11d agoExplain → -
.github
Community health files for NVIDIA Triton
★ 2 19d agoExplain →
No repos match these filters.