Skip to main content

Deployment Strategies

Learn how to deploy NVIDIA Triton Inference Server in production environments, from single-node setups to large-scale Kubernetes clusters.

Deployment Options

1. Docker Standalone

Simplest deployment for development and small-scale production.

Basic Deployment

docker run -d --name triton-server \
--gpus all \
-p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v /path/to/models:/models \
nvcr.io/nvidia/tritonserver:24.10-py3 \
tritonserver --model-repository=/models

With Resource Limits

docker run -d --name triton-server \
--gpus '"device=0,1"' \
--memory=16g \
--cpus=8 \
--restart=unless-stopped \
-p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v /models:/models \
nvcr.io/nvidia/tritonserver:24.10-py3 \
tritonserver \
--model-repository=/models \
--log-verbose=1 \
--strict-model-config=false

2. Docker Compose

For multi-container orchestration.

docker-compose.yml

version: '3.8'

services:
triton:
image: nvcr.io/nvidia/tritonserver:24.10-py3
container_name: triton-inference-server
command: tritonserver --model-repository=/models --log-verbose=1
ports:
- "8000:8000" # HTTP
- "8001:8001" # GRPC
- "8002:8002" # Metrics
volumes:
- ./models:/models:ro
- ./logs:/logs
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- CUDA_VISIBLE_DEVICES=0,1
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/v2/health/ready"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s

prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
restart: unless-stopped

grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
volumes:
- grafana-data:/var/lib/grafana
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
restart: unless-stopped

volumes:
prometheus-data:
grafana-data:

prometheus.yml

global:
scrape_interval: 15s

scrape_configs:
- job_name: 'triton'
static_configs:
- targets: ['triton:8002']

Start services:

docker-compose up -d

3. Kubernetes Deployment

Production-grade orchestration with Kubernetes.

Basic Deployment

# triton-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-inference-server
labels:
app: triton-server
spec:
replicas: 3
selector:
matchLabels:
app: triton-server
template:
metadata:
labels:
app: triton-server
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.10-py3
args:
- tritonserver
- --model-repository=s3://my-bucket/models
- --strict-model-config=false
- --log-verbose=1
ports:
- containerPort: 8000
name: http
- containerPort: 8001
name: grpc
- containerPort: 8002
name: metrics
resources:
limits:
nvidia.com/gpu: 1
memory: "16Gi"
cpu: "8"
requests:
nvidia.com/gpu: 1
memory: "8Gi"
cpu: "4"
livenessProbe:
httpGet:
path: /v2/health/live
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key-id
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret-access-key
---
apiVersion: v1
kind: Service
metadata:
name: triton-service
spec:
type: LoadBalancer
selector:
app: triton-server
ports:
- name: http
port: 8000
targetPort: 8000
- name: grpc
port: 8001
targetPort: 8001
- name: metrics
port: 8002
targetPort: 8002

Deploy:

kubectl apply -f triton-deployment.yaml

Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: triton-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: triton-inference-server
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80

4. Helm Chart Deployment

Use Helm for easier Kubernetes deployments.

Add Repository

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

Install with Custom Values

Create values.yaml:

image:
repository: nvcr.io/nvidia/tritonserver
tag: 24.10-py3
pullPolicy: IfNotPresent

replicaCount: 3

service:
type: LoadBalancer
httpPort: 8000
grpcPort: 8001
metricsPort: 8002

modelRepositoryPath: s3://my-bucket/models

resources:
limits:
nvidia.com/gpu: 1
memory: 16Gi
cpu: 8
requests:
nvidia.com/gpu: 1
memory: 8Gi
cpu: 4

autoscaling:
enabled: true
minReplicas: 2
maxReplicas: 10
targetCPUUtilizationPercentage: 70

ingress:
enabled: true
className: nginx
hosts:
- host: triton.example.com
paths:
- path: /
pathType: Prefix
tls:
- secretName: triton-tls
hosts:
- triton.example.com

metrics:
enabled: true
serviceMonitor:
enabled: true

Install:

helm install triton nvidia/triton-inference-server -f values.yaml

5. Cloud-Specific Deployments

AWS EKS

# Create EKS cluster with GPU nodes
eksctl create cluster \
--name triton-cluster \
--region us-west-2 \
--nodegroup-name gpu-nodes \
--node-type p3.2xlarge \
--nodes 3 \
--nodes-min 2 \
--nodes-max 5

# Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml

# Deploy Triton
kubectl apply -f triton-deployment.yaml

AWS SageMaker

import sagemaker
from sagemaker.triton import TritonModel

# Define model
triton_model = TritonModel(
model_uri="s3://my-bucket/model-repository/",
role="arn:aws:iam::123456789012:role/SageMakerRole",
image_uri="nvcr.io/nvidia/tritonserver:24.10-py3",
container_startup_health_check_timeout=600
)

# Deploy
predictor = triton_model.deploy(
instance_type="ml.g4dn.xlarge",
initial_instance_count=2,
endpoint_name="triton-endpoint"
)

# Inference
import numpy as np
payload = {
"inputs": [{
"name": "input",
"shape": [1, 3, 224, 224],
"datatype": "FP32",
"data": np.random.randn(1, 3, 224, 224).tolist()
}]
}
response = predictor.predict(payload)

Google Cloud GKE

# Create GKE cluster
gcloud container clusters create triton-cluster \
--accelerator type=nvidia-tesla-t4,count=1 \
--machine-type n1-standard-4 \
--num-nodes 3 \
--zone us-central1-a

# Install NVIDIA drivers
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

# Deploy Triton
kubectl apply -f triton-deployment.yaml

Azure AKS

# Create AKS cluster with GPU
az aks create \
--resource-group myResourceGroup \
--name tritonCluster \
--node-count 3 \
--node-vm-size Standard_NC6 \
--generate-ssh-keys

# Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml

# Deploy Triton
kubectl apply -f triton-deployment.yaml

6. Edge Deployment

For edge devices with limited resources.

NVIDIA Jetson

# Pull ARM-compatible image
docker pull nvcr.io/nvidia/tritonserver:24.10-py3-jetpack5.1

# Run on Jetson
docker run -d --runtime nvidia \
--network host \
-v /models:/models \
nvcr.io/nvidia/tritonserver:24.10-py3-jetpack5.1 \
tritonserver \
--model-repository=/models \
--backend-config=tensorflow,version=2

Resource-Constrained Setup

# Use minimal image and limit resources
docker run -d \
--cpus=2 \
--memory=4g \
-p 8000:8000 \
-v /models:/models \
nvcr.io/nvidia/tritonserver:24.10-py3-min \
tritonserver \
--model-repository=/models \
--model-control-mode=explicit \
--load-model=efficient_model

Load Balancing

NGINX Configuration

upstream triton_backend {
least_conn;
server triton-1:8000 max_fails=3 fail_timeout=30s;
server triton-2:8000 max_fails=3 fail_timeout=30s;
server triton-3:8000 max_fails=3 fail_timeout=30s;
}

server {
listen 80;
server_name triton.example.com;

location / {
proxy_pass http://triton_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_connect_timeout 60s;
proxy_send_timeout 300s;
proxy_read_timeout 300s;
}

location /v2/health/ready {
access_log off;
proxy_pass http://triton_backend;
}
}

HAProxy Configuration

frontend triton_frontend
bind *:8000
mode http
default_backend triton_backend

backend triton_backend
mode http
balance roundrobin
option httpchk GET /v2/health/ready
server triton1 triton-1:8000 check
server triton2 triton-2:8000 check
server triton3 triton-3:8000 check

High Availability

Active-Active Setup

apiVersion: apps/v1
kind: StatefulSet
metadata:
name: triton-server
spec:
serviceName: triton
replicas: 3
selector:
matchLabels:
app: triton
template:
metadata:
labels:
app: triton
spec:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- triton
topologyKey: kubernetes.io/hostname
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.10-py3
# ... rest of config

Multi-Region Deployment

# Region 1: us-west-2
kubectl config use-context us-west-2
kubectl apply -f triton-deployment.yaml

# Region 2: us-east-1
kubectl config use-context us-east-1
kubectl apply -f triton-deployment.yaml

# Global load balancer (AWS Route53, GCP Cloud DNS, etc.)

Security

TLS/SSL Configuration

apiVersion: v1
kind: Secret
metadata:
name: triton-tls
type: kubernetes.io/tls
data:
tls.crt: <base64-encoded-cert>
tls.key: <base64-encoded-key>
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: triton-ingress
spec:
tls:
- hosts:
- triton.example.com
secretName: triton-tls
rules:
- host: triton.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: triton-service
port:
number: 8000

Authentication with API Gateway

# Kong API Gateway example
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
name: triton-auth
plugin: key-auth

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: triton-ingress
annotations:
konghq.com/plugins: triton-auth
spec:
# ... rest of ingress config

Monitoring and Logging

Prometheus Integration

apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'triton'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: triton-server
- source_labels: [__meta_kubernetes_pod_container_port_number]
action: keep
regex: "8002"

Grafana Dashboard

Import Triton dashboard ID: 16181

# Access Grafana
kubectl port-forward svc/grafana 3000:3000

# Navigate to http://localhost:3000
# Import dashboard with ID 16181

CI/CD Integration

GitHub Actions

name: Deploy Triton

on:
push:
branches: [main]

jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2

- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: us-west-2

- name: Update kubeconfig
run: aws eks update-kubeconfig --name triton-cluster

- name: Deploy to Kubernetes
run: kubectl apply -f k8s/triton-deployment.yaml

- name: Wait for rollout
run: kubectl rollout status deployment/triton-inference-server

Best Practices

  1. Use Health Checks: Always configure liveness and readiness probes
  2. Resource Limits: Set appropriate CPU, memory, and GPU limits
  3. Model Repository: Use cloud storage (S3, GCS) for centralized models
  4. Monitoring: Enable Prometheus metrics and Grafana dashboards
  5. Scaling: Use HPA for automatic scaling based on load
  6. High Availability: Deploy across multiple zones/regions
  7. Security: Enable TLS, authentication, and network policies
  8. Logging: Centralize logs using Elasticsearch or CloudWatch
  9. Versioning: Use semantic versioning for model deployments
  10. Testing: Test new models in staging before production

Next Steps