Deployment Strategies

Learn how to deploy NVIDIA Triton Inference Server in production environments, from single-node setups to large-scale Kubernetes clusters.

Deployment Options

1. Docker Standalone

Simplest deployment for development and small-scale production.

Basic Deployment

docker run -d --name triton-server \
    --gpus all \
    -p 8000:8000 -p 8001:8001 -p 8002:8002 \
    -v /path/to/models:/models \
    nvcr.io/nvidia/tritonserver:24.10-py3 \
    tritonserver --model-repository=/models

With Resource Limits

docker run -d --name triton-server \
    --gpus '"device=0,1"' \
    --memory=16g \
    --cpus=8 \
    --restart=unless-stopped \
    -p 8000:8000 -p 8001:8001 -p 8002:8002 \
    -v /models:/models \
    nvcr.io/nvidia/tritonserver:24.10-py3 \
    tritonserver \
        --model-repository=/models \
        --log-verbose=1 \
        --strict-model-config=false

2. Docker Compose

For multi-container orchestration.

docker-compose.yml

version: '3.8'

services:
  triton:
    image: nvcr.io/nvidia/tritonserver:24.10-py3
    container_name: triton-inference-server
    command: tritonserver --model-repository=/models --log-verbose=1
    ports:
      - "8000:8000"  # HTTP
      - "8001:8001"  # GRPC
      - "8002:8002"  # Metrics
    volumes:
      - ./models:/models:ro
      - ./logs:/logs
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0,1
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/v2/health/ready"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 40s

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
    restart: unless-stopped

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    volumes:
      - grafana-data:/var/lib/grafana
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    restart: unless-stopped

volumes:
  prometheus-data:
  grafana-data:

prometheus.yml

global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'triton'
    static_configs:
      - targets: ['triton:8002']

Start services:

docker-compose up -d

3. Kubernetes Deployment

Production-grade orchestration with Kubernetes.

Basic Deployment

# triton-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-inference-server
  labels:
    app: triton-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: triton-server
  template:
    metadata:
      labels:
        app: triton-server
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.10-py3
        args:
          - tritonserver
          - --model-repository=s3://my-bucket/models
          - --strict-model-config=false
          - --log-verbose=1
        ports:
        - containerPort: 8000
          name: http
        - containerPort: 8001
          name: grpc
        - containerPort: 8002
          name: metrics
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: "16Gi"
            cpu: "8"
          requests:
            nvidia.com/gpu: 1
            memory: "8Gi"
            cpu: "4"
        livenessProbe:
          httpGet:
            path: /v2/health/live
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /v2/health/ready
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        env:
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: aws-credentials
              key: access-key-id
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: aws-credentials
              key: secret-access-key
---
apiVersion: v1
kind: Service
metadata:
  name: triton-service
spec:
  type: LoadBalancer
  selector:
    app: triton-server
  ports:
  - name: http
    port: 8000
    targetPort: 8000
  - name: grpc
    port: 8001
    targetPort: 8001
  - name: metrics
    port: 8002
    targetPort: 8002

Deploy:

kubectl apply -f triton-deployment.yaml

Horizontal Pod Autoscaler

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: triton-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: triton-inference-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80

4. Helm Chart Deployment

Use Helm for easier Kubernetes deployments.

Add Repository

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

Install with Custom Values

Create values.yaml:

image:
  repository: nvcr.io/nvidia/tritonserver
  tag: 24.10-py3
  pullPolicy: IfNotPresent

replicaCount: 3

service:
  type: LoadBalancer
  httpPort: 8000
  grpcPort: 8001
  metricsPort: 8002

modelRepositoryPath: s3://my-bucket/models

resources:
  limits:
    nvidia.com/gpu: 1
    memory: 16Gi
    cpu: 8
  requests:
    nvidia.com/gpu: 1
    memory: 8Gi
    cpu: 4

autoscaling:
  enabled: true
  minReplicas: 2
  maxReplicas: 10
  targetCPUUtilizationPercentage: 70

ingress:
  enabled: true
  className: nginx
  hosts:
    - host: triton.example.com
      paths:
        - path: /
          pathType: Prefix
  tls:
    - secretName: triton-tls
      hosts:
        - triton.example.com

metrics:
  enabled: true
  serviceMonitor:
    enabled: true

Install:

helm install triton nvidia/triton-inference-server -f values.yaml

5. Cloud-Specific Deployments

AWS EKS

# Create EKS cluster with GPU nodes
eksctl create cluster \
  --name triton-cluster \
  --region us-west-2 \
  --nodegroup-name gpu-nodes \
  --node-type p3.2xlarge \
  --nodes 3 \
  --nodes-min 2 \
  --nodes-max 5

# Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml

# Deploy Triton
kubectl apply -f triton-deployment.yaml

AWS SageMaker

import sagemaker
from sagemaker.triton import TritonModel

# Define model
triton_model = TritonModel(
    model_uri="s3://my-bucket/model-repository/",
    role="arn:aws:iam::123456789012:role/SageMakerRole",
    image_uri="nvcr.io/nvidia/tritonserver:24.10-py3",
    container_startup_health_check_timeout=600
)

# Deploy
predictor = triton_model.deploy(
    instance_type="ml.g4dn.xlarge",
    initial_instance_count=2,
    endpoint_name="triton-endpoint"
)

# Inference
import numpy as np
payload = {
    "inputs": [{
        "name": "input",
        "shape": [1, 3, 224, 224],
        "datatype": "FP32",
        "data": np.random.randn(1, 3, 224, 224).tolist()
    }]
}
response = predictor.predict(payload)

Google Cloud GKE

# Create GKE cluster
gcloud container clusters create triton-cluster \
    --accelerator type=nvidia-tesla-t4,count=1 \
    --machine-type n1-standard-4 \
    --num-nodes 3 \
    --zone us-central1-a

# Install NVIDIA drivers
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

# Deploy Triton
kubectl apply -f triton-deployment.yaml

Azure AKS

# Create AKS cluster with GPU
az aks create \
    --resource-group myResourceGroup \
    --name tritonCluster \
    --node-count 3 \
    --node-vm-size Standard_NC6 \
    --generate-ssh-keys

# Install NVIDIA device plugin
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml

# Deploy Triton
kubectl apply -f triton-deployment.yaml

6. Edge Deployment

For edge devices with limited resources.

NVIDIA Jetson

# Pull ARM-compatible image
docker pull nvcr.io/nvidia/tritonserver:24.10-py3-jetpack5.1

# Run on Jetson
docker run -d --runtime nvidia \
    --network host \
    -v /models:/models \
    nvcr.io/nvidia/tritonserver:24.10-py3-jetpack5.1 \
    tritonserver \
        --model-repository=/models \
        --backend-config=tensorflow,version=2

Resource-Constrained Setup

# Use minimal image and limit resources
docker run -d \
    --cpus=2 \
    --memory=4g \
    -p 8000:8000 \
    -v /models:/models \
    nvcr.io/nvidia/tritonserver:24.10-py3-min \
    tritonserver \
        --model-repository=/models \
        --model-control-mode=explicit \
        --load-model=efficient_model

Load Balancing

NGINX Configuration

upstream triton_backend {
    least_conn;
    server triton-1:8000 max_fails=3 fail_timeout=30s;
    server triton-2:8000 max_fails=3 fail_timeout=30s;
    server triton-3:8000 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name triton.example.com;

    location / {
        proxy_pass http://triton_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_connect_timeout 60s;
        proxy_send_timeout 300s;
        proxy_read_timeout 300s;
    }

    location /v2/health/ready {
        access_log off;
        proxy_pass http://triton_backend;
    }
}

HAProxy Configuration

frontend triton_frontend
    bind *:8000
    mode http
    default_backend triton_backend

backend triton_backend
    mode http
    balance roundrobin
    option httpchk GET /v2/health/ready
    server triton1 triton-1:8000 check
    server triton2 triton-2:8000 check
    server triton3 triton-3:8000 check

High Availability

Active-Active Setup

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: triton-server
spec:
  serviceName: triton
  replicas: 3
  selector:
    matchLabels:
      app: triton
  template:
    metadata:
      labels:
        app: triton
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - triton
            topologyKey: kubernetes.io/hostname
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.10-py3
        # ... rest of config

Multi-Region Deployment

# Region 1: us-west-2
kubectl config use-context us-west-2
kubectl apply -f triton-deployment.yaml

# Region 2: us-east-1
kubectl config use-context us-east-1
kubectl apply -f triton-deployment.yaml

# Global load balancer (AWS Route53, GCP Cloud DNS, etc.)

Security

TLS/SSL Configuration

apiVersion: v1
kind: Secret
metadata:
  name: triton-tls
type: kubernetes.io/tls
data:
  tls.crt: <base64-encoded-cert>
  tls.key: <base64-encoded-key>
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: triton-ingress
spec:
  tls:
  - hosts:
    - triton.example.com
    secretName: triton-tls
  rules:
  - host: triton.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: triton-service
            port:
              number: 8000

Authentication with API Gateway

# Kong API Gateway example
apiVersion: configuration.konghq.com/v1
kind: KongPlugin
metadata:
  name: triton-auth
plugin: key-auth

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: triton-ingress
  annotations:
    konghq.com/plugins: triton-auth
spec:
  # ... rest of ingress config

Monitoring and Logging

Prometheus Integration

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
      - job_name: 'triton'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_label_app]
            action: keep
            regex: triton-server
          - source_labels: [__meta_kubernetes_pod_container_port_number]
            action: keep
            regex: "8002"

Grafana Dashboard

Import Triton dashboard ID: 16181

# Access Grafana
kubectl port-forward svc/grafana 3000:3000

# Navigate to http://localhost:3000
# Import dashboard with ID 16181

CI/CD Integration

GitHub Actions

name: Deploy Triton

on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2
      
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v1
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: us-west-2
      
      - name: Update kubeconfig
        run: aws eks update-kubeconfig --name triton-cluster
      
      - name: Deploy to Kubernetes
        run: kubectl apply -f k8s/triton-deployment.yaml
      
      - name: Wait for rollout
        run: kubectl rollout status deployment/triton-inference-server

Best Practices

Use Health Checks: Always configure liveness and readiness probes
Resource Limits: Set appropriate CPU, memory, and GPU limits
Model Repository: Use cloud storage (S3, GCS) for centralized models
Monitoring: Enable Prometheus metrics and Grafana dashboards
Scaling: Use HPA for automatic scaling based on load
High Availability: Deploy across multiple zones/regions
Security: Enable TLS, authentication, and network policies
Logging: Centralize logs using Elasticsearch or CloudWatch
Versioning: Use semantic versioning for model deployments
Testing: Test new models in staging before production

Next Steps

Performance Optimization - Optimize throughput and latency
Best Practices - Production-ready patterns
Troubleshooting - Common issues and solutions

Deployment Options​

1. Docker Standalone​

Basic Deployment​

With Resource Limits​

2. Docker Compose​

docker-compose.yml​

prometheus.yml​

3. Kubernetes Deployment​

Basic Deployment​

Horizontal Pod Autoscaler​

4. Helm Chart Deployment​

Add Repository​

Install with Custom Values​

5. Cloud-Specific Deployments​

AWS EKS​

AWS SageMaker​

Google Cloud GKE​

Azure AKS​

6. Edge Deployment​

NVIDIA Jetson​

Resource-Constrained Setup​

Load Balancing​

NGINX Configuration​

HAProxy Configuration​

High Availability​

Active-Active Setup​

Multi-Region Deployment​

Security​

TLS/SSL Configuration​

Authentication with API Gateway​

Monitoring and Logging​

Prometheus Integration​

Grafana Dashboard​

CI/CD Integration​

GitHub Actions​

Best Practices​

Next Steps​

Deployment Options

1. Docker Standalone

Basic Deployment

With Resource Limits

2. Docker Compose

docker-compose.yml

prometheus.yml

3. Kubernetes Deployment

Basic Deployment

Horizontal Pod Autoscaler

4. Helm Chart Deployment

Add Repository

Install with Custom Values

5. Cloud-Specific Deployments

AWS EKS

AWS SageMaker

Google Cloud GKE

Azure AKS

6. Edge Deployment

NVIDIA Jetson

Resource-Constrained Setup

Load Balancing

NGINX Configuration

HAProxy Configuration

High Availability

Active-Active Setup

Multi-Region Deployment

Security

TLS/SSL Configuration

Authentication with API Gateway

Monitoring and Logging

Prometheus Integration

Grafana Dashboard

CI/CD Integration

GitHub Actions

Best Practices

Next Steps