Skip to main content

Best Practices

Production-ready best practices for deploying and managing NVIDIA Triton Inference Server at scale.

Model Management

Version Control

Semantic Versioning:

models/
└── resnet50/
├── 1/ # v1.0.0 - Initial release
├── 2/ # v1.1.0 - Performance improvements
├── 3/ # v2.0.0 - Architecture change
└── config.pbtxt

Version Policy:

# Keep last 3 versions
version_policy: {
latest { num_versions: 3 }
}

# Or use labels for explicit control
# labels.txt:
# stable 2
# canary 3
# latest 3

Model Testing Before Deployment

import tritonclient.http as httpclient
import numpy as np

def validate_model(model_name, version, test_data, expected_shape):
"""Validate model before production deployment."""
client = httpclient.InferenceServerClient("localhost:8000")

# Check model is loaded
model_metadata = client.get_model_metadata(model_name, version)
assert model_metadata.name == model_name

# Test inference
inputs = httpclient.InferInput("input", test_data.shape, "FP32")
inputs.set_data_from_numpy(test_data)

results = client.infer(model_name, inputs=[inputs], model_version=version)
output = results.as_numpy("output")

# Validate output shape
assert output.shape == expected_shape, f"Expected {expected_shape}, got {output.shape}"

# Validate output range
assert not np.isnan(output).any(), "Output contains NaN"
assert not np.isinf(output).any(), "Output contains Inf"

return True

# Run validation
test_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
validate_model("resnet50", "3", test_data, (1, 1000))

Gradual Rollout Strategy

# Step 1: Deploy new version alongside old
version_policy: {
specific { versions: [2, 3] } # Old=2, New=3
}

# Step 2: Route percentage of traffic to new version
# Use load balancer or A/B testing framework

# Step 3: Monitor metrics for both versions

# Step 4: Fully switch to new version
version_policy: {
latest { num_versions: 1 }
}

Configuration Management

Centralized Configuration

# config/triton-config.yaml
server:
model_repository: s3://my-bucket/models
log_level: INFO
metrics_port: 8002
strict_model_config: false

models:
resnet50:
max_batch_size: 8
instances: 2
dynamic_batching:
preferred_batch_size: [4, 8]
max_queue_delay_us: 100

bert_base:
max_batch_size: 16
instances: 1
dynamic_batching:
preferred_batch_size: [8, 16]
max_queue_delay_us: 200

Environment-Specific Configs

# Development
tritonserver \
--model-repository=/local/models \
--log-verbose=1 \
--model-control-mode=explicit

# Staging
tritonserver \
--model-repository=s3://staging-bucket/models \
--log-info=1 \
--strict-model-config=true

# Production
tritonserver \
--model-repository=s3://prod-bucket/models \
--log-info=1 \
--strict-model-config=true \
--exit-timeout-secs=30

Resource Management

GPU Memory Allocation

# Limit memory per model instance
parameters {
key: "gpu_memory_fraction"
value: { string_value: "0.3" }
}

# Multiple models sharing GPU
# Model 1: 30% GPU memory
# Model 2: 30% GPU memory
# Model 3: 30% GPU memory
# Reserve: 10% for overhead

CPU Thread Configuration

# ONNX Runtime
parameters {
key: "intra_op_thread_count"
value: { string_value: "4" }
}
parameters {
key: "inter_op_thread_count"
value: { string_value: "2" }
}

# PyTorch
parameters {
key: "PYTORCH_THREADS"
value: { string_value: "4" }
}

Request Rate Limiting

instance_group [
{
count: 2
kind: KIND_GPU
rate_limiter {
resources [
{
name: "GPU_MEMORY"
count: 1000 # MB
}
]
}
}
]

Security Best Practices

Authentication

Using NGINX Reverse Proxy:

server {
listen 443 ssl;
server_name triton.example.com;

ssl_certificate /etc/ssl/certs/triton.crt;
ssl_certificate_key /etc/ssl/private/triton.key;

# Basic auth
auth_basic "Triton Inference Server";
auth_basic_user_file /etc/nginx/.htpasswd;

location / {
proxy_pass http://triton-backend:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}

Input Validation

# Python backend model.py
import triton_python_backend_utils as pb_utils
import numpy as np

class TritonPythonModel:
def execute(self, requests):
responses = []
for request in requests:
input_tensor = pb_utils.get_input_tensor_by_name(request, "input")
input_data = input_tensor.as_numpy()

# Validate shape
if input_data.shape != (3, 224, 224):
error = pb_utils.TritonError(
"Invalid input shape. Expected (3, 224, 224)"
)
responses.append(pb_utils.InferenceResponse(error=error))
continue

# Validate range
if input_data.min() < -3 or input_data.max() > 3:
error = pb_utils.TritonError(
"Input values out of expected range [-3, 3]"
)
responses.append(pb_utils.InferenceResponse(error=error))
continue

# Process valid input
# ...

return responses

Network Isolation

# Kubernetes NetworkPolicy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: triton-network-policy
spec:
podSelector:
matchLabels:
app: triton-server
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: api-gateway
ports:
- protocol: TCP
port: 8000
- protocol: TCP
port: 8001
egress:
- to:
- podSelector:
matchLabels:
app: model-storage
ports:
- protocol: TCP
port: 443

Monitoring and Alerting

Essential Metrics

# Prometheus alerts
groups:
- name: triton_alerts
rules:
# High error rate
- alert: HighErrorRate
expr: |
rate(nv_inference_request_failure[5m]) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "High inference error rate"

# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(nv_inference_request_duration_us_bucket[5m])
) > 100000
for: 5m
labels:
severity: warning
annotations:
summary: "95th percentile latency > 100ms"

# Low GPU utilization
- alert: LowGPUUtilization
expr: |
avg(nv_gpu_utilization) < 30
for: 10m
labels:
severity: info
annotations:
summary: "GPU utilization below 30%"

# GPU memory exhaustion
- alert: GPUMemoryHigh
expr: |
(nv_gpu_memory_used_bytes / nv_gpu_memory_total_bytes) > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "GPU memory usage > 90%"

Logging Best Practices

# Structured logging
tritonserver \
--model-repository=/models \
--log-verbose=0 \
--log-info=1 \
--log-format=json 2>&1 | \
tee /var/log/triton/server.log | \
jq -r 'select(.level == "ERROR")'

Health Checks

# Kubernetes liveness and readiness
livenessProbe:
httpGet:
path: /v2/health/live
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3

readinessProbe:
httpGet:
path: /v2/health/ready
port: 8000
initialDelaySeconds: 30
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3

Error Handling

Client-Side Retry Logic

import tritonclient.http as httpclient
import time
from tritonclient.utils import InferenceServerException

def infer_with_retry(client, model_name, inputs, max_retries=3, backoff=1):
"""Robust inference with exponential backoff."""
for attempt in range(max_retries):
try:
result = client.infer(model_name, inputs=inputs)
return result
except InferenceServerException as e:
if attempt == max_retries - 1:
raise

# Check if retryable error
if "unavailable" in str(e).lower() or "timeout" in str(e).lower():
wait_time = backoff * (2 ** attempt)
print(f"Retry attempt {attempt + 1}/{max_retries} after {wait_time}s")
time.sleep(wait_time)
else:
raise

raise Exception("Max retries exceeded")

# Usage
client = httpclient.InferenceServerClient("localhost:8000")
result = infer_with_retry(client, "model", inputs)

Circuit Breaker Pattern

from datetime import datetime, timedelta

class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_threshold = failure_threshold
self.timeout = timeout
self.failures = 0
self.last_failure_time = None
self.state = "CLOSED" # CLOSED, OPEN, HALF_OPEN

def call(self, func, *args, **kwargs):
if self.state == "OPEN":
if datetime.now() - self.last_failure_time > timedelta(seconds=self.timeout):
self.state = "HALF_OPEN"
else:
raise Exception("Circuit breaker is OPEN")

try:
result = func(*args, **kwargs)
self.on_success()
return result
except Exception as e:
self.on_failure()
raise

def on_success(self):
self.failures = 0
self.state = "CLOSED"

def on_failure(self):
self.failures += 1
self.last_failure_time = datetime.now()
if self.failures >= self.failure_threshold:
self.state = "OPEN"

# Usage
cb = CircuitBreaker()
result = cb.call(client.infer, "model", inputs=[inputs])

Performance Testing

Load Testing Script

import concurrent.futures
import time
import numpy as np
import tritonclient.http as httpclient
from collections import defaultdict

def load_test(
url="localhost:8000",
model_name="model",
num_requests=1000,
concurrency=50,
input_shape=(1, 3, 224, 224)
):
"""Comprehensive load test."""
client = httpclient.InferenceServerClient(url)
input_data = np.random.randn(*input_shape).astype(np.float32)

results = {
'success': 0,
'failure': 0,
'latencies': []
}

def send_request():
start = time.time()
try:
inputs = httpclient.InferInput("input", input_data.shape, "FP32")
inputs.set_data_from_numpy(input_data)
client.infer(model_name, inputs=[inputs])
latency = (time.time() - start) * 1000 # ms
results['latencies'].append(latency)
results['success'] += 1
except Exception as e:
results['failure'] += 1
print(f"Error: {e}")

start_time = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as executor:
futures = [executor.submit(send_request) for _ in range(num_requests)]
concurrent.futures.wait(futures)

duration = time.time() - start_time

# Calculate statistics
latencies = sorted(results['latencies'])
return {
'total_requests': num_requests,
'success': results['success'],
'failure': results['failure'],
'duration_s': duration,
'throughput': results['success'] / duration,
'latency_p50': np.percentile(latencies, 50),
'latency_p95': np.percentile(latencies, 95),
'latency_p99': np.percentile(latencies, 99),
'latency_avg': np.mean(latencies),
}

# Run test
stats = load_test(concurrency=50, num_requests=1000)
print(f"""
Load Test Results:
------------------
Total Requests: {stats['total_requests']}
Success: {stats['success']}
Failure: {stats['failure']}
Duration: {stats['duration_s']:.2f}s
Throughput: {stats['throughput']:.2f} req/s
Latency (avg): {stats['latency_avg']:.2f}ms
Latency (p50): {stats['latency_p50']:.2f}ms
Latency (p95): {stats['latency_p95']:.2f}ms
Latency (p99): {stats['latency_p99']:.2f}ms
""")

Disaster Recovery

Backup Strategies

# Backup model repository
aws s3 sync s3://prod-models/ s3://backup-models/ \
--exclude "*/2/*" \
--exclude "*/1/*" # Keep only latest versions

# Backup configuration
kubectl get configmap triton-config -o yaml > backup/triton-config.yaml
kubectl get deployment triton-server -o yaml > backup/triton-deployment.yaml

Rollback Procedure

# 1. Identify issue
kubectl logs -l app=triton-server --tail=100

# 2. Rollback deployment
kubectl rollout undo deployment/triton-server

# 3. Verify
kubectl rollout status deployment/triton-server

# 4. Check health
curl http://triton-service:8000/v2/health/ready

Documentation

Model Cards

Create models/resnet50/README.md:

# ResNet50 Image Classification

## Model Information
- **Version**: 3.0.0
- **Framework**: PyTorch 2.0
- **Input**: RGB image (3, 224, 224)
- **Output**: 1000 class probabilities
- **Precision**: FP16

## Performance
- Latency (p95): 5ms
- Throughput: 2000 infer/s
- GPU: NVIDIA T4

## Deployment
- Instance Count: 2
- Batch Size: 8
- Dynamic Batching: Enabled

## Change Log
### v3.0.0 (2024-10-01)
- Improved accuracy by 2%
- Reduced latency by 15%
- Updated to PyTorch 2.0

### v2.0.0 (2024-08-01)
- Architecture changes
- TensorRT optimization

Checklist for Production

  • Models

    • Version control implemented
    • Model cards documented
    • Validation tests pass
    • Rollback strategy defined
  • Configuration

    • Environment-specific configs
    • Resource limits set
    • Dynamic batching tuned
    • Warmup configured
  • Security

    • TLS/SSL enabled
    • Authentication configured
    • Input validation implemented
    • Network policies defined
  • Monitoring

    • Metrics collection enabled
    • Alerts configured
    • Dashboards created
    • Log aggregation setup
  • Reliability

    • Health checks configured
    • Auto-scaling enabled
    • Circuit breakers implemented
    • Retry logic added
  • Performance

    • Load testing completed
    • Latency targets met
    • GPU utilization optimized
    • Bottlenecks identified
  • Operations

    • Deployment automation
    • Backup strategy
    • Rollback procedure
    • Runbooks created

Next Steps