Performance Optimization
Optimize NVIDIA Triton Inference Server for maximum throughput and minimum latency. This guide covers various optimization techniques and best practices.
Performance Analysis
Using perf_analyzer
Triton's built-in performance analyzer helps identify bottlenecks.
Basic Usage
# Run from SDK container
docker run -it --rm --net=host \
nvcr.io/nvidia/tritonserver:24.10-py3-sdk \
perf_analyzer \
-m resnet50 \
-u localhost:8000 \
--concurrency-range 1:8:2 \
--shape input:1,3,224,224
Advanced Analysis
perf_analyzer \
-m resnet50 \
-u localhost:8000 \
--measurement-interval 10000 \
--concurrency-range 1:16 \
--percentile=95 \
--input-data random \
--shape input:3,224,224 \
-i grpc \
--streaming \
--collect-metrics
Output Analysis
Concurrency: 1, throughput: 1234.5 infer/sec, latency 809 usec
Concurrency: 2, throughput: 2345.6 infer/sec, latency 852 usec
Concurrency: 4, throughput: 3456.7 infer/sec, latency 1157 usec
Concurrency: 8, throughput: 4123.4 infer/sec, latency 1940 usec
Key metrics:
- Throughput: Requests per second
- Latency: Time from request to response (p50, p95, p99)
- Queue Time: Time spent waiting in queue
Model Analyzer
Comprehensive model profiling tool.
Installation
pip install triton-model-analyzer
Profile Model
model-analyzer profile \
--model-repository /models \
--profile-models resnet50 \
--triton-launch-mode docker \
--output-model-repository-path /output/models \
--export-path profile_results
Generate Report
model-analyzer report \
--report-model-configs resnet50 \
--export-path profile_results \
--config-names resnet50_config_0,resnet50_config_1
Dynamic Batching Optimization
Configuration
dynamic_batching {
preferred_batch_size: [ 4, 8, 16 ]
max_queue_delay_microseconds: 100
preserve_ordering: false
priority_levels: 2
default_priority_level: 0
default_queue_policy {
timeout_action: REJECT
default_timeout_microseconds: 10000
max_queue_size: 100
}
priority_queue_policy {
1: {
timeout_action: DELAY
default_timeout_microseconds: 5000
max_queue_size: 50
}
}
}
Tuning Parameters
max_queue_delay_microseconds: Balance between latency and throughput
- Lower (10-50µs): Better latency, lower throughput
- Higher (100-500µs): Worse latency, higher throughput
preferred_batch_size: Model-specific optimal sizes
# Find optimal batch size
for bs in 1 2 4 8 16 32; do
perf_analyzer -m model --batch-size $bs
done
Instance Optimization
GPU Instance Configuration
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0 ]
}
]
Finding Optimal Instance Count
import subprocess
import json
results = []
for count in range(1, 9):
# Update config
config = f"""
instance_group [
{{ count: {count}, kind: KIND_GPU, gpus: [ 0 ] }}
]
"""
# Reload model
# Run perf_analyzer
result = subprocess.run([
"perf_analyzer", "-m", "model",
"--concurrency-range", "1:32"
], capture_output=True, text=True)
# Parse results
results.append({"count": count, "output": result.stdout})
# Analyze results
Multi-GPU Configuration
# Strategy 1: Replicas on each GPU
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0, 1 ] # 2 instances on each GPU = 4 total
}
]
# Strategy 2: Different models on different GPUs
instance_group [
{
count: 4
kind: KIND_GPU
gpus: [ 0 ]
}
]
TensorRT Optimization
Convert ONNX to TensorRT
import tensorrt as trt
def build_engine(onnx_path, engine_path, precision='fp16'):
logger = trt.Logger(trt.Logger.WARNING)
builder = trt.Builder(logger)
network = builder.create_network(
1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
)
parser = trt.OnnxParser(network, logger)
with open(onnx_path, 'rb') as f:
parser.parse(f.read())
config = builder.create_builder_config()
config.max_workspace_size = 1 << 30 # 1GB
if precision == 'fp16':
config.set_flag(trt.BuilderFlag.FP16)
elif precision == 'int8':
config.set_flag(trt.BuilderFlag.INT8)
engine = builder.build_engine(network, config)
with open(engine_path, 'wb') as f:
f.write(engine.serialize())
build_engine('model.onnx', 'model.plan', 'fp16')
TensorRT Configuration
name: "trt_model"
platform: "tensorrt_plan"
max_batch_size: 16
input [
{
name: "input"
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
}
]
optimization {
cuda {
graphs: true
graph_spec {
batch_size: 1
input {
key: "input"
value: {
dims: [ 3, 224, 224 ]
}
}
}
}
execution_accelerators {
gpu_execution_accelerator : [
{
name : "tensorrt"
parameters {
key: "precision_mode"
value: "FP16"
}
parameters {
key: "max_workspace_size_bytes"
value: "1073741824"
}
}
]
}
}
Memory Optimization
GPU Memory Management
# Limit GPU memory per instance
parameters {
key: "gpu_memory_fraction"
value: { string_value: "0.5" }
}
# Allow memory growth (TensorFlow)
parameters {
key: "allow_gpu_memory_growth"
value: { string_value: "true" }
}
Model Instance Pinning
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
passive: false
# Pin to specific CUDA stream
host_policy: "gpus_0"
}
]
Shared Memory
Use shared memory for zero-copy data transfer:
Client side:
import tritonclient.http as httpclient
from tritonclient.utils import *
import numpy as np
# Create shared memory region
triton_client = httpclient.InferenceServerClient("localhost:8000")
# Register shared memory
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
input_byte_size = input_data.size * input_data.itemsize
triton_client.unregister_system_shared_memory()
triton_client.unregister_cuda_shared_memory()
# Create and register
shm_handle = shm.create_shared_memory_region(
"input_data", "/input_shm", input_byte_size
)
shm.set_shared_memory_region(shm_handle, [input_data])
triton_client.register_system_shared_memory(
"input_data", "/input_shm", input_byte_size
)
# Use in inference
inputs = httpclient.InferInput("input", input_data.shape, "FP32")
inputs.set_shared_memory("input_data", input_byte_size)
results = triton_client.infer("model", inputs=[inputs])
Backend Optimization
ONNX Runtime
optimization {
execution_accelerators {
cpu_execution_accelerator : [
{
name : "openvino"
}
]
gpu_execution_accelerator : [
{
name : "tensorrt"
parameters { key: "precision_mode" value: "FP16" }
}
]
}
}
# ONNX Runtime session options
parameters {
key: "intra_op_thread_count"
value: { string_value: "8" }
}
parameters {
key: "inter_op_thread_count"
value: { string_value: "2" }
}
parameters {
key: "execution_mode"
value: { string_value: "parallel" }
}
PyTorch
# Enable JIT optimizations
parameters {
key: "ENABLE_NVFUSER"
value: { string_value: "1" }
}
# Thread configuration
parameters {
key: "PYTORCH_THREADS"
value: { string_value: "8" }
}
TensorFlow
# GPU memory configuration
parameters {
key: "gpu_memory_fraction"
value: { string_value: "0.8" }
}
# XLA compilation
parameters {
key: "TF_XLA_FLAGS"
value: { string_value: "--tf_xla_auto_jit=2" }
}
Network Optimization
Protocol Selection
GRPC vs HTTP:
- GRPC: Lower latency, streaming support
- HTTP: Better compatibility, easier debugging
# GRPC client (faster)
import tritonclient.grpc as grpcclient
client = grpcclient.InferenceServerClient("localhost:8001")
# HTTP client (compatible)
import tritonclient.http as httpclient
client = httpclient.InferenceServerClient("localhost:8000")
Compression
Enable compression for large payloads:
# GRPC with compression
import grpc
channel = grpc.insecure_channel(
'localhost:8001',
options=[
('grpc.default_compression_algorithm', grpc.Compression.Gzip),
('grpc.default_compression_level', grpc.CompressionLevel.High)
]
)
client = grpcclient.InferenceServerClient("localhost:8001")
Request Batching on Client
import numpy as np
import tritonclient.http as httpclient
client = httpclient.InferenceServerClient("localhost:8000")
# Batch multiple samples
batch_size = 8
input_data = np.random.randn(batch_size, 3, 224, 224).astype(np.float32)
inputs = httpclient.InferInput("input", input_data.shape, "FP32")
inputs.set_data_from_numpy(input_data)
results = client.infer("model", inputs=[inputs])
output = results.as_numpy("output") # Shape: (8, num_classes)
Model Ensemble Optimization
Pipeline Parallelism
name: "ensemble_model"
platform: "ensemble"
ensemble_scheduling {
step [
{
model_name: "preprocessing"
model_version: -1
},
{
model_name: "inference"
model_version: -1
},
{
model_name: "postprocessing"
model_version: -1
}
]
}
Parallel Execution
ensemble_scheduling {
step [
{
model_name: "detector_1"
model_version: -1
},
{
model_name: "detector_2"
model_version: -1
}
]
# Both run in parallel
}
Quantization
INT8 Quantization
import torch
from torch.quantization import quantize_dynamic
# Dynamic quantization (post-training)
model_fp32 = MyModel()
model_int8 = quantize_dynamic(
model_fp32,
{torch.nn.Linear},
dtype=torch.qint8
)
# Save quantized model
torch.jit.save(torch.jit.script(model_int8), "model_int8.pt")
Mixed Precision
optimization {
execution_accelerators {
gpu_execution_accelerator : [
{
name : "tensorrt"
parameters {
key: "precision_mode"
value: "FP16"
}
}
]
}
}
Caching
Response Cache
response_cache {
enable: true
}
model_transaction_policy {
decoupled: false
}
Model Warmup
model_warmup [
{
name: "warmup_full_batch"
batch_size: 8
inputs {
key: "input"
value: {
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
random_data: true
}
}
},
{
name: "warmup_single"
batch_size: 1
inputs {
key: "input"
value: {
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
zero_data: true
}
}
}
]
Benchmarking Best Practices
1. Baseline Measurement
# Measure without optimization
perf_analyzer -m model --concurrency-range 1 > baseline.txt
2. Systematic Testing
#!/bin/bash
for instances in 1 2 4 8; do
for batch_size in 1 2 4 8 16; do
echo "Testing instances=$instances, batch_size=$batch_size"
# Update config
# Restart Triton
perf_analyzer -m model --batch-size $batch_size >> results.txt
done
done
3. Load Testing
import concurrent.futures
import time
import numpy as np
import tritonclient.http as httpclient
def send_request(client, model_name, input_data):
inputs = httpclient.InferInput("input", input_data.shape, "FP32")
inputs.set_data_from_numpy(input_data)
return client.infer(model_name, inputs=[inputs])
# Simulate load
client = httpclient.InferenceServerClient("localhost:8000")
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
num_requests = 1000
num_workers = 50
start = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
futures = [
executor.submit(send_request, client, "model", input_data)
for _ in range(num_requests)
]
results = [f.result() for f in concurrent.futures.as_completed(futures)]
duration = time.time() - start
throughput = num_requests / duration
print(f"Throughput: {throughput:.2f} req/s")
Monitoring Performance
Key Metrics to Track
# Request metrics
nv_inference_request_success
nv_inference_request_failure
nv_inference_request_duration_us
# Queue metrics
nv_inference_queue_duration_us
nv_inference_pending_request_count
# Execution metrics
nv_inference_compute_input_duration_us
nv_inference_compute_output_duration_us
nv_inference_compute_infer_duration_us
# Resource metrics
nv_gpu_utilization
nv_gpu_memory_used_bytes
nv_gpu_power_usage
Grafana Query Examples
# Average latency
rate(nv_inference_request_duration_us_sum[5m])
/
rate(nv_inference_request_duration_us_count[5m])
# Throughput
sum(rate(nv_inference_request_success[5m]))
# GPU utilization
avg(nv_gpu_utilization{gpu_uuid=~".*"})
Performance Checklist
- Profile model with perf_analyzer
- Enable dynamic batching with optimal batch sizes
- Set appropriate instance count (1-4 per GPU typically)
- Use TensorRT for NVIDIA GPUs
- Enable FP16/INT8 precision where possible
- Configure CUDA graphs for small models
- Use GRPC instead of HTTP for lower latency
- Enable model warmup
- Monitor GPU utilization (target: >80%)
- Optimize queue delay vs batch size trade-off
- Use shared memory for large inputs
- Profile end-to-end latency including preprocessing
- Test under realistic load patterns
- Set up alerts for performance degradation
Next Steps
- Best Practices - Production-ready patterns
- Troubleshooting - Performance debugging