Skip to main content

Performance Optimization

Optimize NVIDIA Triton Inference Server for maximum throughput and minimum latency. This guide covers various optimization techniques and best practices.

Performance Analysis

Using perf_analyzer

Triton's built-in performance analyzer helps identify bottlenecks.

Basic Usage

# Run from SDK container
docker run -it --rm --net=host \
nvcr.io/nvidia/tritonserver:24.10-py3-sdk \
perf_analyzer \
-m resnet50 \
-u localhost:8000 \
--concurrency-range 1:8:2 \
--shape input:1,3,224,224

Advanced Analysis

perf_analyzer \
-m resnet50 \
-u localhost:8000 \
--measurement-interval 10000 \
--concurrency-range 1:16 \
--percentile=95 \
--input-data random \
--shape input:3,224,224 \
-i grpc \
--streaming \
--collect-metrics

Output Analysis

Concurrency: 1, throughput: 1234.5 infer/sec, latency 809 usec
Concurrency: 2, throughput: 2345.6 infer/sec, latency 852 usec
Concurrency: 4, throughput: 3456.7 infer/sec, latency 1157 usec
Concurrency: 8, throughput: 4123.4 infer/sec, latency 1940 usec

Key metrics:

  • Throughput: Requests per second
  • Latency: Time from request to response (p50, p95, p99)
  • Queue Time: Time spent waiting in queue

Model Analyzer

Comprehensive model profiling tool.

Installation

pip install triton-model-analyzer

Profile Model

model-analyzer profile \
--model-repository /models \
--profile-models resnet50 \
--triton-launch-mode docker \
--output-model-repository-path /output/models \
--export-path profile_results

Generate Report

model-analyzer report \
--report-model-configs resnet50 \
--export-path profile_results \
--config-names resnet50_config_0,resnet50_config_1

Dynamic Batching Optimization

Configuration

dynamic_batching {
preferred_batch_size: [ 4, 8, 16 ]
max_queue_delay_microseconds: 100
preserve_ordering: false
priority_levels: 2
default_priority_level: 0

default_queue_policy {
timeout_action: REJECT
default_timeout_microseconds: 10000
max_queue_size: 100
}

priority_queue_policy {
1: {
timeout_action: DELAY
default_timeout_microseconds: 5000
max_queue_size: 50
}
}
}

Tuning Parameters

max_queue_delay_microseconds: Balance between latency and throughput

  • Lower (10-50µs): Better latency, lower throughput
  • Higher (100-500µs): Worse latency, higher throughput

preferred_batch_size: Model-specific optimal sizes

# Find optimal batch size
for bs in 1 2 4 8 16 32; do
perf_analyzer -m model --batch-size $bs
done

Instance Optimization

GPU Instance Configuration

instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0 ]
}
]

Finding Optimal Instance Count

import subprocess
import json

results = []
for count in range(1, 9):
# Update config
config = f"""
instance_group [
{{ count: {count}, kind: KIND_GPU, gpus: [ 0 ] }}
]
"""

# Reload model
# Run perf_analyzer
result = subprocess.run([
"perf_analyzer", "-m", "model",
"--concurrency-range", "1:32"
], capture_output=True, text=True)

# Parse results
results.append({"count": count, "output": result.stdout})

# Analyze results

Multi-GPU Configuration

# Strategy 1: Replicas on each GPU
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0, 1 ] # 2 instances on each GPU = 4 total
}
]

# Strategy 2: Different models on different GPUs
instance_group [
{
count: 4
kind: KIND_GPU
gpus: [ 0 ]
}
]

TensorRT Optimization

Convert ONNX to TensorRT

import tensorrt as trt

def build_engine(onnx_path, engine_path, precision='fp16'):
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)

with open(onnx_path, 'rb') as f:
parser.parse(f.read())

config = builder.create_builder_config()
config.max_workspace_size = 1 << 30 # 1GB

if precision == 'fp16':
config.set_flag(trt.BuilderFlag.FP16)
elif precision == 'int8':
config.set_flag(trt.BuilderFlag.INT8)

engine = builder.build_engine(network, config)

with open(engine_path, 'wb') as f:
f.write(engine.serialize())

build_engine('model.onnx', 'model.plan', 'fp16')

TensorRT Configuration

name: "trt_model"
platform: "tensorrt_plan"
max_batch_size: 16

input [
{
name: "input"
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
}
]

optimization {
cuda {
graphs: true
graph_spec {
batch_size: 1
input {
key: "input"
value: {
dims: [ 3, 224, 224 ]
}
}
}
}

execution_accelerators {
gpu_execution_accelerator : [
{
name : "tensorrt"
parameters {
key: "precision_mode"
value: "FP16"
}
parameters {
key: "max_workspace_size_bytes"
value: "1073741824"
}
}
]
}
}

Memory Optimization

GPU Memory Management

# Limit GPU memory per instance
parameters {
key: "gpu_memory_fraction"
value: { string_value: "0.5" }
}

# Allow memory growth (TensorFlow)
parameters {
key: "allow_gpu_memory_growth"
value: { string_value: "true" }
}

Model Instance Pinning

instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
passive: false

# Pin to specific CUDA stream
host_policy: "gpus_0"
}
]

Shared Memory

Use shared memory for zero-copy data transfer:

Client side:

import tritonclient.http as httpclient
from tritonclient.utils import *
import numpy as np

# Create shared memory region
triton_client = httpclient.InferenceServerClient("localhost:8000")

# Register shared memory
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
input_byte_size = input_data.size * input_data.itemsize

triton_client.unregister_system_shared_memory()
triton_client.unregister_cuda_shared_memory()

# Create and register
shm_handle = shm.create_shared_memory_region(
"input_data", "/input_shm", input_byte_size
)
shm.set_shared_memory_region(shm_handle, [input_data])

triton_client.register_system_shared_memory(
"input_data", "/input_shm", input_byte_size
)

# Use in inference
inputs = httpclient.InferInput("input", input_data.shape, "FP32")
inputs.set_shared_memory("input_data", input_byte_size)

results = triton_client.infer("model", inputs=[inputs])

Backend Optimization

ONNX Runtime

optimization {
execution_accelerators {
cpu_execution_accelerator : [
{
name : "openvino"
}
]
gpu_execution_accelerator : [
{
name : "tensorrt"
parameters { key: "precision_mode" value: "FP16" }
}
]
}
}

# ONNX Runtime session options
parameters {
key: "intra_op_thread_count"
value: { string_value: "8" }
}
parameters {
key: "inter_op_thread_count"
value: { string_value: "2" }
}
parameters {
key: "execution_mode"
value: { string_value: "parallel" }
}

PyTorch

# Enable JIT optimizations
parameters {
key: "ENABLE_NVFUSER"
value: { string_value: "1" }
}

# Thread configuration
parameters {
key: "PYTORCH_THREADS"
value: { string_value: "8" }
}

TensorFlow

# GPU memory configuration
parameters {
key: "gpu_memory_fraction"
value: { string_value: "0.8" }
}

# XLA compilation
parameters {
key: "TF_XLA_FLAGS"
value: { string_value: "--tf_xla_auto_jit=2" }
}

Network Optimization

Protocol Selection

GRPC vs HTTP:

  • GRPC: Lower latency, streaming support
  • HTTP: Better compatibility, easier debugging
# GRPC client (faster)
import tritonclient.grpc as grpcclient
client = grpcclient.InferenceServerClient("localhost:8001")

# HTTP client (compatible)
import tritonclient.http as httpclient
client = httpclient.InferenceServerClient("localhost:8000")

Compression

Enable compression for large payloads:

# GRPC with compression
import grpc
channel = grpc.insecure_channel(
'localhost:8001',
options=[
('grpc.default_compression_algorithm', grpc.Compression.Gzip),
('grpc.default_compression_level', grpc.CompressionLevel.High)
]
)
client = grpcclient.InferenceServerClient("localhost:8001")

Request Batching on Client

import numpy as np
import tritonclient.http as httpclient

client = httpclient.InferenceServerClient("localhost:8000")

# Batch multiple samples
batch_size = 8
input_data = np.random.randn(batch_size, 3, 224, 224).astype(np.float32)

inputs = httpclient.InferInput("input", input_data.shape, "FP32")
inputs.set_data_from_numpy(input_data)

results = client.infer("model", inputs=[inputs])
output = results.as_numpy("output") # Shape: (8, num_classes)

Model Ensemble Optimization

Pipeline Parallelism

name: "ensemble_model"
platform: "ensemble"

ensemble_scheduling {
step [
{
model_name: "preprocessing"
model_version: -1
},
{
model_name: "inference"
model_version: -1
},
{
model_name: "postprocessing"
model_version: -1
}
]
}

Parallel Execution

ensemble_scheduling {
step [
{
model_name: "detector_1"
model_version: -1
},
{
model_name: "detector_2"
model_version: -1
}
]
# Both run in parallel
}

Quantization

INT8 Quantization

import torch
from torch.quantization import quantize_dynamic

# Dynamic quantization (post-training)
model_fp32 = MyModel()
model_int8 = quantize_dynamic(
model_fp32,
{torch.nn.Linear},
dtype=torch.qint8
)

# Save quantized model
torch.jit.save(torch.jit.script(model_int8), "model_int8.pt")

Mixed Precision

optimization {
execution_accelerators {
gpu_execution_accelerator : [
{
name : "tensorrt"
parameters {
key: "precision_mode"
value: "FP16"
}
}
]
}
}

Caching

Response Cache

response_cache {
enable: true
}

model_transaction_policy {
decoupled: false
}

Model Warmup

model_warmup [
{
name: "warmup_full_batch"
batch_size: 8
inputs {
key: "input"
value: {
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
random_data: true
}
}
},
{
name: "warmup_single"
batch_size: 1
inputs {
key: "input"
value: {
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
zero_data: true
}
}
}
]

Benchmarking Best Practices

1. Baseline Measurement

# Measure without optimization
perf_analyzer -m model --concurrency-range 1 > baseline.txt

2. Systematic Testing

#!/bin/bash
for instances in 1 2 4 8; do
for batch_size in 1 2 4 8 16; do
echo "Testing instances=$instances, batch_size=$batch_size"
# Update config
# Restart Triton
perf_analyzer -m model --batch-size $batch_size >> results.txt
done
done

3. Load Testing

import concurrent.futures
import time
import numpy as np
import tritonclient.http as httpclient

def send_request(client, model_name, input_data):
inputs = httpclient.InferInput("input", input_data.shape, "FP32")
inputs.set_data_from_numpy(input_data)
return client.infer(model_name, inputs=[inputs])

# Simulate load
client = httpclient.InferenceServerClient("localhost:8000")
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)

num_requests = 1000
num_workers = 50

start = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
futures = [
executor.submit(send_request, client, "model", input_data)
for _ in range(num_requests)
]
results = [f.result() for f in concurrent.futures.as_completed(futures)]

duration = time.time() - start
throughput = num_requests / duration
print(f"Throughput: {throughput:.2f} req/s")

Monitoring Performance

Key Metrics to Track

# Request metrics
nv_inference_request_success
nv_inference_request_failure
nv_inference_request_duration_us

# Queue metrics
nv_inference_queue_duration_us
nv_inference_pending_request_count

# Execution metrics
nv_inference_compute_input_duration_us
nv_inference_compute_output_duration_us
nv_inference_compute_infer_duration_us

# Resource metrics
nv_gpu_utilization
nv_gpu_memory_used_bytes
nv_gpu_power_usage

Grafana Query Examples

# Average latency
rate(nv_inference_request_duration_us_sum[5m])
/
rate(nv_inference_request_duration_us_count[5m])

# Throughput
sum(rate(nv_inference_request_success[5m]))

# GPU utilization
avg(nv_gpu_utilization{gpu_uuid=~".*"})

Performance Checklist

  • Profile model with perf_analyzer
  • Enable dynamic batching with optimal batch sizes
  • Set appropriate instance count (1-4 per GPU typically)
  • Use TensorRT for NVIDIA GPUs
  • Enable FP16/INT8 precision where possible
  • Configure CUDA graphs for small models
  • Use GRPC instead of HTTP for lower latency
  • Enable model warmup
  • Monitor GPU utilization (target: >80%)
  • Optimize queue delay vs batch size trade-off
  • Use shared memory for large inputs
  • Profile end-to-end latency including preprocessing
  • Test under realistic load patterns
  • Set up alerts for performance degradation

Next Steps