Deployment Strategies
Learn how to deploy NVIDIA Triton Inference Server in production environments, from single-node setups to large-scale Kubernetes clusters.
Deployment Options
1. Docker Standalone
Simplest deployment for development and small-scale production.
Basic Deployment
docker run -d --name triton-server \
--gpus all \
-p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v /path/to/models:/models \
nvcr.io/nvidia/tritonserver:24.10-py3 \
tritonserver --model-repository=/models
With Resource Limits
docker run -d --name triton-server \
--gpus '"device=0,1"' \
--memory=16g \
--cpus=8 \
--restart=unless-stopped \
-p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v /models:/models \
nvcr.io/nvidia/tritonserver:24.10-py3 \
tritonserver \
--model-repository=/models \
--log-verbose=1 \
--strict-model-config=false
2. Docker Compose
For multi-container orchestration.
docker-compose.yml
version: '3.8'
services:
triton:
image: nvcr.io/nvidia/tritonserver:24.10-py3
container_name: triton-inference-server
command: tritonserver --model-repository=/models --log-verbose=1
ports:
- "8000:8000" # HTTP
- "8001:8001" # GRPC
- "8002:8002" # Metrics
volumes:
- ./models:/models:ro
- ./logs:/logs
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- CUDA_VISIBLE_DEVICES=0,1
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/v2/health/ready"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
restart: unless-stopped
volumes:
prometheus-data:
grafana-data:
prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'triton'
static_configs:
- targets: ['triton:8002']
Start services:
docker-compose up -d
3. Kubernetes Deployment
Production-grade orchestration with Kubernetes.
Basic Deployment
# triton-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-inference-server
labels:
app: triton-server
spec:
replicas: 3
selector:
matchLabels:
app: triton-server
template:
metadata:
labels:
app: triton-server
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.10-py3
args:
- tritonserver
- --model-repository=s3://my-bucket/models
- --strict-model-config=false
- --log-verbose=1
ports:
- containerPort: 8000
name: http
- containerPort: 8001
name: grpc
- containerPort: 8002
name: metrics
resources:
limits:
nvidia.com/gpu: 1
memory: "16Gi"
cpu: "8"
requests:
nvidia.com/gpu: 1
memory: "8Gi"
cpu: "4"
livenessProbe:
httpGet:
path: /v2/health/live
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key-id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret-access-key
---
apiVersion: v1
kind: Service
metadata:
name: triton-service
spec:
type: LoadBalancer
selector:
app: triton-server
ports:
- name: http
port: 8000
targetPort: 8000
- name: grpc
port: 8001
targetPort: 8001
- name: metrics
port: 8002
targetPort: 8002
Deploy:
kubectl apply -f triton-deployment.yaml
Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: triton-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: triton-inference-server
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
4. Helm Chart Deployment
Use Helm for easier Kubernetes deployments.
Add Repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
Install with Custom Values
Create values.yaml:
image:
repository: nvcr.io/nvidia/tritonserver
tag: 24.10-py3
pullPolicy: IfNotPresent
replicaCount: 3
service:
type: LoadBalancer
httpPort: 8000
grpcPort: 8001
metricsPort: 8002
modelRepositoryPath: s3://my-bucket/models
resources:
limits:
nvidia.com/gpu: 1
memory: 16Gi
cpu: 8
requests:
nvidia.com/gpu: 1
memory: 8Gi
cpu: 4
autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70
ingress:
enabled: true
className: nginx
hosts:
- host: triton.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: triton-tls
hosts:
- triton.example.com
metrics:
enabled: true
serviceMonitor:
enabled: true
Install:
helm install triton nvidia/triton-inference-server -f values.yaml
5. Cloud-Specific Deployments
AWS EKS
# Create EKS cluster with GPU nodes
eksctl create cluster \
--name triton-cluster \
--region us-west-2 \
--nodegroup-name gpu-nodes \
--node-type p3.2xlarge \
--nodes 3 \
--nodes-min 2 \
--nodes-max 5
# Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml
# Deploy Triton
kubectl apply -f triton-deployment.yaml
AWS SageMaker
import sagemaker
from sagemaker.triton import TritonModel
# Define model
triton_model = TritonModel(
model_uri="s3://my-bucket/model-repository/",
role="arn:aws:iam::123456789012:role/SageMakerRole",
image_uri="nvcr.io/nvidia/tritonserver:24.10-py3",
container_startup_health_check_timeout=600
)
# Deploy
predictor = triton_model.deploy(
instance_type="ml.g4dn.xlarge",
initial_instance_count=2,
endpoint_name="triton-endpoint"
)
# Inference
import numpy as np
payload = {
"inputs": [{
"name": "input",
"shape": [1, 3, 224, 224],
"datatype": "FP32",
"data": np.random.randn(1, 3, 224, 224).tolist()
}]
}
response = predictor.predict(payload)
Google Cloud GKE
# Create GKE cluster
gcloud container clusters create triton-cluster \
--accelerator type=nvidia-tesla-t4,count=1 \
--machine-type n1-standard-4 \
--num-nodes 3 \
--zone us-central1-a
# Install NVIDIA drivers
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
# Deploy Triton
kubectl apply -f triton-deployment.yaml
Azure AKS
# Create AKS cluster with GPU
az aks create \
--resource-group myResourceGroup \
--name tritonCluster \
--node-count 3 \
--node-vm-size Standard_NC6 \
--generate-ssh-keys
# Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml
# Deploy Triton
kubectl apply -f triton-deployment.yaml
6. Edge Deployment
For edge devices with limited resources.
NVIDIA Jetson
# Pull ARM-compatible image
docker pull nvcr.io/nvidia/tritonserver:24.10-py3-jetpack5.1
# Run on Jetson
docker run -d --runtime nvidia \
--network host \
-v /models:/models \
nvcr.io/nvidia/tritonserver:24.10-py3-jetpack5.1 \
tritonserver \
--model-repository=/models \
--backend-config=tensorflow,version=2
Resource-Constrained Setup
# Use minimal image and limit resources
docker run -d \
--cpus=2 \
--memory=4g \
-p 8000:8000 \
-v /models:/models \
nvcr.io/nvidia/tritonserver:24.10-py3-min \
tritonserver \
--model-repository=/models \
--model-control-mode=explicit \
--load-model=efficient_model
Load Balancing
NGINX Configuration
upstream triton_backend {
least_conn;
server triton-1:8000 max_fails=3 fail_timeout=30s;
server triton-2:8000 max_fails=3 fail_timeout=30s;
server triton-3:8000 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
server_name triton.example.com;
location / {
proxy_pass http://triton_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_connect_timeout 60s;
proxy_send_timeout 300s;
proxy_read_timeout 300s;
}
location /v2/health/ready {
access_log off;
proxy_pass http://triton_backend;
}
}
HAProxy Configuration
frontend triton_frontend
bind *:8000
mode http
default_backend triton_backend
backend triton_backend
mode http
balance roundrobin
option httpchk GET /v2/health/ready
server triton1 triton-1:8000 check
server triton2 triton-2:8000 check
server triton3 triton-3:8000 check
High Availability
Active-Active Setup
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: triton-server
spec:
serviceName: triton
replicas: 3
selector:
matchLabels:
app: triton
template:
metadata:
labels:
app: triton
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- triton
topologyKey: kubernetes.io/hostname
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.10-py3
# ... rest of config
Multi-Region Deployment
# Region 1: us-west-2
kubectl config use-context us-west-2
kubectl apply -f triton-deployment.yaml
# Region 2: us-east-1
kubectl config use-context us-east-1
kubectl apply -f triton-deployment.yaml
# Global load balancer (AWS Route53, GCP Cloud DNS, etc.)
Security
TLS/SSL Configuration
apiVersion: v1
kind: Secret
metadata:
name: triton-tls
type: kubernetes.io/tls
data:
tls.crt: <base64-encoded-cert>
tls.key: <base64-encoded-key>
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: triton-ingress
spec:
tls:
- hosts:
- triton.example.com
secretName: triton-tls
rules:
- host: triton.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: triton-service
port:
number: 8000
Authentication with API Gateway
# Kong API Gateway example
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
name: triton-auth
plugin: key-auth
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: triton-ingress
annotations:
konghq.com/plugins: triton-auth
spec:
# ... rest of ingress config
Monitoring and Logging
Prometheus Integration
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'triton'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: triton-server
- source_labels: [__meta_kubernetes_pod_container_port_number]
action: keep
regex: "8002"
Grafana Dashboard
Import Triton dashboard ID: 16181
# Access Grafana
kubectl port-forward svc/grafana 3000:3000
# Navigate to http://localhost:3000
# Import dashboard with ID 16181
CI/CD Integration
GitHub Actions
name: Deploy Triton
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-west-2
- name: Update kubeconfig
run: aws eks update-kubeconfig --name triton-cluster
- name: Deploy to Kubernetes
run: kubectl apply -f k8s/triton-deployment.yaml
- name: Wait for rollout
run: kubectl rollout status deployment/triton-inference-server
Best Practices
- Use Health Checks: Always configure liveness and readiness probes
- Resource Limits: Set appropriate CPU, memory, and GPU limits
- Model Repository: Use cloud storage (S3, GCS) for centralized models
- Monitoring: Enable Prometheus metrics and Grafana dashboards
- Scaling: Use HPA for automatic scaling based on load
- High Availability: Deploy across multiple zones/regions
- Security: Enable TLS, authentication, and network policies
- Logging: Centralize logs using Elasticsearch or CloudWatch
- Versioning: Use semantic versioning for model deployments
- Testing: Test new models in staging before production
Next Steps
- Performance Optimization - Optimize throughput and latency
- Best Practices - Production-ready patterns
- Troubleshooting - Common issues and solutions