Best Practices
Production-ready best practices for deploying and managing NVIDIA Triton Inference Server at scale.
Model Management
Version Control
Semantic Versioning:
models/
└── resnet50/
├── 1/ # v1.0.0 - Initial release
├── 2/ # v1.1.0 - Performance improvements
├── 3/ # v2.0.0 - Architecture change
└── config.pbtxt
Version Policy:
# Keep last 3 versions
version_policy: {
latest { num_versions: 3 }
}
# Or use labels for explicit control
# labels.txt:
# stable 2
# canary 3
# latest 3
Model Testing Before Deployment
import tritonclient.http as httpclient
import numpy as np
def validate_model(model_name, version, test_data, expected_shape):
"""Validate model before production deployment."""
client = httpclient.InferenceServerClient("localhost:8000")
# Check model is loaded
model_metadata = client.get_model_metadata(model_name, version)
assert model_metadata.name == model_name
# Test inference
inputs = httpclient.InferInput("input", test_data.shape, "FP32")
inputs.set_data_from_numpy(test_data)
results = client.infer(model_name, inputs=[inputs], model_version=version)
output = results.as_numpy("output")
# Validate output shape
assert output.shape == expected_shape, f"Expected {expected_shape}, got {output.shape}"
# Validate output range
assert not np.isnan(output).any(), "Output contains NaN"
assert not np.isinf(output).any(), "Output contains Inf"
return True
# Run validation
test_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
validate_model("resnet50", "3", test_data, (1, 1000))
Gradual Rollout Strategy
# Step 1: Deploy new version alongside old
version_policy: {
specific { versions: [2, 3] } # Old=2, New=3
}
# Step 2: Route percentage of traffic to new version
# Use load balancer or A/B testing framework
# Step 3: Monitor metrics for both versions
# Step 4: Fully switch to new version
version_policy: {
latest { num_versions: 1 }
}
Configuration Management
Centralized Configuration
# config/triton-config.yaml
server:
model_repository: s3://my-bucket/models
log_level: INFO
metrics_port: 8002
strict_model_config: false
models:
resnet50:
max_batch_size: 8
instances: 2
dynamic_batching:
preferred_batch_size: [4, 8]
max_queue_delay_us: 100
bert_base:
max_batch_size: 16
instances: 1
dynamic_batching:
preferred_batch_size: [8, 16]
max_queue_delay_us: 200
Environment-Specific Configs
# Development
tritonserver \
--model-repository=/local/models \
--log-verbose=1 \
--model-control-mode=explicit
# Staging
tritonserver \
--model-repository=s3://staging-bucket/models \
--log-info=1 \
--strict-model-config=true
# Production
tritonserver \
--model-repository=s3://prod-bucket/models \
--log-info=1 \
--strict-model-config=true \
--exit-timeout-secs=30
Resource Management
GPU Memory Allocation
# Limit memory per model instance
parameters {
key: "gpu_memory_fraction"
value: { string_value: "0.3" }
}
# Multiple models sharing GPU
# Model 1: 30% GPU memory
# Model 2: 30% GPU memory
# Model 3: 30% GPU memory
# Reserve: 10% for overhead
CPU Thread Configuration
# ONNX Runtime
parameters {
key: "intra_op_thread_count"
value: { string_value: "4" }
}
parameters {
key: "inter_op_thread_count"
value: { string_value: "2" }
}
# PyTorch
parameters {
key: "PYTORCH_THREADS"
value: { string_value: "4" }
}
Request Rate Limiting
instance_group [
{
count: 2
kind: KIND_GPU
rate_limiter {
resources [
{
name: "GPU_MEMORY"
count: 1000 # MB
}
]
}
}
]
Security Best Practices
Authentication
Using NGINX Reverse Proxy:
server {
listen 443 ssl;
server_name triton.example.com;
ssl_certificate /etc/ssl/certs/triton.crt;
ssl_certificate_key /etc/ssl/private/triton.key;
# Basic auth
auth_basic "Triton Inference Server";
auth_basic_user_file /etc/nginx/.htpasswd;
location / {
proxy_pass http://triton-backend:8000;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
}
Input Validation
# Python backend model.py
import triton_python_backend_utils as pb_utils
import numpy as np
class TritonPythonModel:
def execute(self, requests):
responses = []
for request in requests:
input_tensor = pb_utils.get_input_tensor_by_name(request, "input")
input_data = input_tensor.as_numpy()
# Validate shape
if input_data.shape != (3, 224, 224):
error = pb_utils.TritonError(
"Invalid input shape. Expected (3, 224, 224)"
)
responses.append(pb_utils.InferenceResponse(error=error))
continue
# Validate range
if input_data.min() < -3 or input_data.max() > 3:
error = pb_utils.TritonError(
"Input values out of expected range [-3, 3]"
)
responses.append(pb_utils.InferenceResponse(error=error))
continue
# Process valid input
# ...
return responses
Network Isolation
# Kubernetes NetworkPolicy
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: triton-network-policy
spec:
podSelector:
matchLabels:
app: triton-server
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: api-gateway
ports:
- protocol: TCP
port: 8000
- protocol: TCP
port: 8001
egress:
- to:
- podSelector:
matchLabels:
app: model-storage
ports:
- protocol: TCP
port: 443
Monitoring and Alerting
Essential Metrics
# Prometheus alerts
groups:
- name: triton_alerts
rules:
# High error rate
- alert: HighErrorRate
expr: |
rate(nv_inference_request_failure[5m]) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "High inference error rate"
# High latency
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(nv_inference_request_duration_us_bucket[5m])
) > 100000
for: 5m
labels:
severity: warning
annotations:
summary: "95th percentile latency > 100ms"
# Low GPU utilization
- alert: LowGPUUtilization
expr: |
avg(nv_gpu_utilization) < 30
for: 10m
labels:
severity: info
annotations:
summary: "GPU utilization below 30%"
# GPU memory exhaustion
- alert: GPUMemoryHigh
expr: |
(nv_gpu_memory_used_bytes / nv_gpu_memory_total_bytes) > 0.9
for: 5m
labels:
severity: critical
annotations:
summary: "GPU memory usage > 90%"
Logging Best Practices
# Structured logging
tritonserver \
--model-repository=/models \
--log-verbose=0 \
--log-info=1 \
--log-format=json 2>&1 | \
tee /var/log/triton/server.log | \
jq -r 'select(.level == "ERROR")'
Health Checks
# Kubernetes liveness and readiness
livenessProbe:
httpGet:
path: /v2/health/live
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8000
initialDelaySeconds: 30
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
Error Handling
Client-Side Retry Logic
import tritonclient.http as httpclient
import time
from tritonclient.utils import InferenceServerException
def infer_with_retry(client, model_name, inputs, max_retries=3, backoff=1):
"""Robust inference with exponential backoff."""
for attempt in range(max_retries):
try:
result = client.infer(model_name, inputs=inputs)
return result
except InferenceServerException as e:
if attempt == max_retries - 1:
raise
# Check if retryable error
if "unavailable" in str(e).lower() or "timeout" in str(e).lower():
wait_time = backoff * (2 ** attempt)
print(f"Retry attempt {attempt + 1}/{max_retries} after {wait_time}s")
time.sleep(wait_time)
else:
raise
raise Exception("Max retries exceeded")
# Usage
client = httpclient.InferenceServerClient("localhost:8000")
result = infer_with_retry(client, "model", inputs)
Circuit Breaker Pattern
from datetime import datetime, timedelta
class CircuitBreaker:
def __init__(self, failure_threshold=5, timeout=60):
self.failure_threshold = failure_threshold
self.timeout = timeout
self.failures = 0
self.last_failure_time = None
self.state = "CLOSED" # CLOSED, OPEN, HALF_OPEN
def call(self, func, *args, **kwargs):
if self.state == "OPEN":
if datetime.now() - self.last_failure_time > timedelta(seconds=self.timeout):
self.state = "HALF_OPEN"
else:
raise Exception("Circuit breaker is OPEN")
try:
result = func(*args, **kwargs)
self.on_success()
return result
except Exception as e:
self.on_failure()
raise
def on_success(self):
self.failures = 0
self.state = "CLOSED"
def on_failure(self):
self.failures += 1
self.last_failure_time = datetime.now()
if self.failures >= self.failure_threshold:
self.state = "OPEN"
# Usage
cb = CircuitBreaker()
result = cb.call(client.infer, "model", inputs=[inputs])
Performance Testing
Load Testing Script
import concurrent.futures
import time
import numpy as np
import tritonclient.http as httpclient
from collections import defaultdict
def load_test(
url="localhost:8000",
model_name="model",
num_requests=1000,
concurrency=50,
input_shape=(1, 3, 224, 224)
):
"""Comprehensive load test."""
client = httpclient.InferenceServerClient(url)
input_data = np.random.randn(*input_shape).astype(np.float32)
results = {
'success': 0,
'failure': 0,
'latencies': []
}
def send_request():
start = time.time()
try:
inputs = httpclient.InferInput("input", input_data.shape, "FP32")
inputs.set_data_from_numpy(input_data)
client.infer(model_name, inputs=[inputs])
latency = (time.time() - start) * 1000 # ms
results['latencies'].append(latency)
results['success'] += 1
except Exception as e:
results['failure'] += 1
print(f"Error: {e}")
start_time = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as executor:
futures = [executor.submit(send_request) for _ in range(num_requests)]
concurrent.futures.wait(futures)
duration = time.time() - start_time
# Calculate statistics
latencies = sorted(results['latencies'])
return {
'total_requests': num_requests,
'success': results['success'],
'failure': results['failure'],
'duration_s': duration,
'throughput': results['success'] / duration,
'latency_p50': np.percentile(latencies, 50),
'latency_p95': np.percentile(latencies, 95),
'latency_p99': np.percentile(latencies, 99),
'latency_avg': np.mean(latencies),
}
# Run test
stats = load_test(concurrency=50, num_requests=1000)
print(f"""
Load Test Results:
------------------
Total Requests: {stats['total_requests']}
Success: {stats['success']}
Failure: {stats['failure']}
Duration: {stats['duration_s']:.2f}s
Throughput: {stats['throughput']:.2f} req/s
Latency (avg): {stats['latency_avg']:.2f}ms
Latency (p50): {stats['latency_p50']:.2f}ms
Latency (p95): {stats['latency_p95']:.2f}ms
Latency (p99): {stats['latency_p99']:.2f}ms
""")
Disaster Recovery
Backup Strategies
# Backup model repository
aws s3 sync s3://prod-models/ s3://backup-models/ \
--exclude "*/2/*" \
--exclude "*/1/*" # Keep only latest versions
# Backup configuration
kubectl get configmap triton-config -o yaml > backup/triton-config.yaml
kubectl get deployment triton-server -o yaml > backup/triton-deployment.yaml
Rollback Procedure
# 1. Identify issue
kubectl logs -l app=triton-server --tail=100
# 2. Rollback deployment
kubectl rollout undo deployment/triton-server
# 3. Verify
kubectl rollout status deployment/triton-server
# 4. Check health
curl http://triton-service:8000/v2/health/ready
Documentation
Model Cards
Create models/resnet50/README.md:
# ResNet50 Image Classification
## Model Information
- **Version**: 3.0.0
- **Framework**: PyTorch 2.0
- **Input**: RGB image (3, 224, 224)
- **Output**: 1000 class probabilities
- **Precision**: FP16
## Performance
- Latency (p95): 5ms
- Throughput: 2000 infer/s
- GPU: NVIDIA T4
## Deployment
- Instance Count: 2
- Batch Size: 8
- Dynamic Batching: Enabled
## Change Log
### v3.0.0 (2024-10-01)
- Improved accuracy by 2%
- Reduced latency by 15%
- Updated to PyTorch 2.0
### v2.0.0 (2024-08-01)
- Architecture changes
- TensorRT optimization
Checklist for Production
-
Models
- Version control implemented
- Model cards documented
- Validation tests pass
- Rollback strategy defined
-
Configuration
- Environment-specific configs
- Resource limits set
- Dynamic batching tuned
- Warmup configured
-
Security
- TLS/SSL enabled
- Authentication configured
- Input validation implemented
- Network policies defined
-
Monitoring
- Metrics collection enabled
- Alerts configured
- Dashboards created
- Log aggregation setup
-
Reliability
- Health checks configured
- Auto-scaling enabled
- Circuit breakers implemented
- Retry logic added
-
Performance
- Load testing completed
- Latency targets met
- GPU utilization optimized
- Bottlenecks identified
-
Operations
- Deployment automation
- Backup strategy
- Rollback procedure
- Runbooks created
Next Steps
- Troubleshooting - Common issues and solutions
- Review Performance Optimization for tuning