Skip to main content
Ctrl+K
NVIDIA Triton Inference Server - Home

NVIDIA Triton Inference Server

  • GitHub
NVIDIA Triton Inference Server - Home

NVIDIA Triton Inference Server

  • GitHub

Table of Contents

  • Home
  • Release notes
  • Compatibility matrix

Getting Started

  • Quick Deployment Guide by backend
    • TRT-LLM
    • vLLM
    • Python with HuggingFace
    • PyTorch
    • ONNX
    • TensorFlow
    • Openvino
  • LLM With TRT-LLM
  • Multimodal model
  • Stable diffusion

Scaling guide

  • Multi-Node (AWS)
  • Multi-Instance

LLM Features

  • Constrained Decoding
  • Function Calling
  • Speculative Decoding
    • TRT-LLM
    • vLLM

Client

  • API Reference
    • OpenAI API
    • KServe API
      • HTTP/REST and GRPC Protocol
      • Extensions
        • Binary tensor data extension
        • Classification extension
        • Schedule policy extension
        • Sequence extension
        • Shared-memory extension
        • Model configuration extension
        • Model repository extension
        • Statistics extension
        • Trace extension
        • Logging extension
        • Parameters extension
  • In-Process Triton Server API
    • C/C++
    • Python
      • Kafka I/O
      • Rayserve
    • Java
  • Client Libraries
  • Python tritonclient Package API
    • tritonclient
      • tritonclient.grpc
        • tritonclient.grpc.aio
        • tritonclient.grpc.auth
      • tritonclient.http
        • tritonclient.http.aio
        • tritonclient.http.auth
      • tritonclient.utils
        • tritonclient.utils.cuda_shared_memory
        • tritonclient.utils.shared_memory

Server

  • Concurrent Model Execution
  • Scheduler
  • Batcher
  • Model Pipelines
    • Ensemble
    • Business Logic Scripting
  • State Management
    • Implicit State Management
  • Request Cancellation
  • Rate Limiter
  • Caching
  • Metrics
  • Tracing

Model Management

  • Repository
  • Configuration
  • Optimization
  • Controls
  • Decoupled models
  • Custom operators

Backends

  • TRT-LLM
  • vLLM
    • vLLM Backend
    • Multi-LoRA
  • Python Backend
  • PyTorch (LibTorch) Backend
  • ONNX Runtime
  • TensorFlow
  • TensorRT
  • FIL
  • DALI
  • Custom

Perf benchmarking and tuning

  • GenAI Perf Analyzer
    • Large language models
    • Visual language models
    • Embedding models
    • Ranking models
    • Multiple LoRA adapters
  • Performance Analyzer
    • Recommended Installation Method
    • Inference Load Modes
    • Input Data
    • Measurement Modes
  • Model Analyzer
    • Model Analyzer CLI
    • Launch Modes
    • Table of Contents
    • Model Analyzer Metrics
    • Table of Contents
    • Checkpointing in Model Analyzer
    • Model Analyzer Reports
    • Deploying Model Analyzer on a Kubernetes cluster
  • Model Navigator

Debugging

  • Guide
  • Triton...

Triton Inference Server Release 25.04#

The Triton Inference Server container image, release 25.04, is available on NGC and is open source on GitHub. Release notes can be found on the GitHub Release Page

previous

NVIDIA Triton Inference Server

next

Release Compatibility Matrix

NVIDIA NVIDIA
Privacy Policy | Manage My Privacy | Do Not Sell or Share My Data | Terms of Service | Accessibility | Corporate Policies | Product Security | Contact

Copyright © 2018-2025, NVIDIA Corporation.