Troubleshooting

Common issues and solutions when working with NVIDIA Triton Inference Server.

Server Issues

Server Won't Start

Symptom: Container exits immediately or won't start

Check logs:

docker logs <container_id>

Common causes and solutions:

Invalid model repository path

# Error
E1019 10:00:00.000 server.cc:123] failed to stat '/models': No such file or directory

# Solution
# Verify path exists and is mounted correctly
ls -la /path/to/models
docker run -v /correct/path:/models ...

Permission denied

# Error
E1019 10:00:00.000 server.cc:456] failed to load model: permission denied

# Solution
chmod -R 755 /path/to/models
chown -R $USER:$USER /path/to/models

Port already in use

# Error
bind: address already in use

# Solution
# Find process using port
lsof -i :8000
# Kill process or use different port
docker run -p 8080:8000 ...

Server Crashes Randomly

Symptom: Server exits unexpectedly

Debugging steps:

Check OOM (Out of Memory)

# Check system logs
dmesg | grep -i "out of memory"

# Monitor memory usage
docker stats

# Solution: Increase memory limit
docker run --memory=16g ...

GPU out of memory

# Check GPU memory
nvidia-smi

# Solution: Reduce instances or batch size
instance_group [
  { count: 1, kind: KIND_GPU }
]

Enable core dumps

ulimit -c unlimited
docker run --ulimit core=-1 ...

Model Loading Issues

Model Shows as UNAVAILABLE

Symptom: Model status is UNAVAILABLE

Check model status:

curl localhost:8000/v2/models/my_model

Common causes:

Missing model files

# Verify structure
ls -R models/my_model/
# Should show:
# models/my_model/
# ├── config.pbtxt
# └── 1/
#     └── model.onnx

Invalid config.pbtxt

# Test config syntax
cat models/my_model/config.pbtxt

# Common errors:
# - Missing closing brackets
# - Wrong platform name
# - Invalid data types

Incompatible model format

# Error
E1019 10:00:00.000 model.cc:789] unable to load model: version not supported

# Solution: Check platform compatibility
platform: "onnxruntime_onnx"  # For ONNX
platform: "tensorflow_savedmodel"  # For TensorFlow
platform: "pytorch_libtorch"  # For PyTorch

Model Takes Too Long to Load

Symptom: Server hangs during model loading

Solutions:

Increase timeout

tritonserver \
  --model-repository=/models \
  --backend-config=tensorflow,allow_growth=true \
  --exit-timeout-secs=60

Load models explicitly

tritonserver \
  --model-repository=/models \
  --model-control-mode=explicit \
  --load-model=model1 \
  --load-model=model2

Use model warmup

model_warmup [
  {
    name: "warmup"
    batch_size: 1
    inputs {
      key: "input"
      value: {
        data_type: TYPE_FP32
        dims: [ 3, 224, 224 ]
        zero_data: true
      }
    }
  }
]

Inference Issues

Inference Requests Fail

Symptom: Getting errors on inference requests

Common errors:

Input/Output name mismatch

# Error
Invalid argument: unexpected inference input 'INPUT', expecting 'input'

# Solution: Use correct names from model metadata
curl localhost:8000/v2/models/my_model/config

Shape mismatch

# Error
Invalid argument: unexpected shape for input 'input', expecting [3,224,224], got [224,224,3]

# Solution: Verify input shape
import numpy as np
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)  # Correct
# Not: (1, 224, 224, 3)

Data type mismatch

# Error
Invalid argument: unexpected datatype TYPE_FP64 for input 'input', expecting TYPE_FP32

# Solution: Use correct dtype
input_data = input_data.astype(np.float32)  # Not float64

Slow Inference

Symptom: High latency or low throughput

Diagnosis:

# Use perf_analyzer
perf_analyzer -m my_model -u localhost:8000 --concurrency-range 1:8

# Check metrics
curl localhost:8002/metrics | grep nv_inference

Common causes:

Not using dynamic batching

# Solution: Enable dynamic batching
dynamic_batching {
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 100
}

Too many instances

# Bad: Too many instances competing for GPU
instance_group [
  { count: 8, kind: KIND_GPU }
]

# Good: Optimal instances
instance_group [
  { count: 2, kind: KIND_GPU }
]

CPU backend on GPU model

# Change from CPU to GPU
instance_group [
  { count: 1, kind: KIND_GPU }  # Not KIND_CPU
]

GPU Issues

CUDA Errors

Symptom: CUDA-related errors in logs

CUDA out of memory

# Error
CUDA error: out of memory

# Solution 1: Reduce batch size
max_batch_size: 4  # Instead of 32

# Solution 2: Reduce instances
instance_group [
  { count: 1, kind: KIND_GPU }  # Instead of 4
]

# Solution 3: Limit GPU memory
parameters {
  key: "gpu_memory_fraction"
  value: { string_value: "0.5" }
}

CUDA driver version mismatch

# Error
CUDA driver version is insufficient

# Solution: Update NVIDIA drivers
sudo apt-get update
sudo apt-get install --reinstall nvidia-driver-535
nvidia-smi  # Verify

GPU Not Detected

Symptom: Triton doesn't see GPU

Debugging:

# Check GPU visibility
nvidia-smi

# Check Docker GPU access
docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

# If fails, install nvidia-container-toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
    sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Network Issues

Cannot Connect to Server

Symptom: Connection refused or timeout

Debugging steps:

Check server is running

docker ps | grep triton

Check port binding

netstat -tulpn | grep 8000

Test locally

# From host
curl localhost:8000/v2/health/ready

# From container
docker exec <container_id> curl localhost:8000/v2/health/ready

Check firewall

# Ubuntu/Debian
sudo ufw status
sudo ufw allow 8000/tcp

# CentOS/RHEL
sudo firewall-cmd --add-port=8000/tcp --permanent
sudo firewall-cmd --reload

GRPC Connection Issues

Symptom: GRPC requests fail but HTTP works

Solutions:

Check GRPC port

# Ensure port 8001 is exposed
docker run -p 8001:8001 ...

Test GRPC connection

import tritonclient.grpc as grpcclient

try:
    client = grpcclient.InferenceServerClient("localhost:8001")
    print(client.is_server_live())
except Exception as e:
    print(f"Connection failed: {e}")

Performance Issues

Low GPU Utilization

Symptom: GPU usage < 50%

Diagnosis:

# Monitor GPU
nvidia-smi dmon -s u

# Check metrics
curl localhost:8002/metrics | grep nv_gpu_utilization

Solutions:

Increase concurrency

# Send more concurrent requests
perf_analyzer -m model --concurrency-range 8:16

Enable dynamic batching

dynamic_batching {
  preferred_batch_size: [ 8, 16 ]
}

Increase instances

instance_group [
  { count: 2, kind: KIND_GPU }
]

High Queue Time

Symptom: Requests spending too long in queue

Check metrics:

curl localhost:8002/metrics | grep nv_inference_queue_duration_us

Solutions:

Increase instance count

instance_group [
  { count: 4, kind: KIND_GPU }  # More instances
]

Adjust batching parameters

dynamic_batching {
  max_queue_delay_microseconds: 50  # Reduce delay
  preferred_batch_size: [ 4, 8 ]
}

Add more servers

# Scale horizontally
kubectl scale deployment triton-server --replicas=3

Client Issues

Python Client Import Error

Symptom: Cannot import tritonclient

Solution:

# Install client
pip install tritonclient[all]

# Or specific protocol
pip install tritonclient[http]
pip install tritonclient[grpc]

# Verify
python -c "import tritonclient.http as httpclient; print('Success')"

Timeout Errors

Symptom: Client requests timeout

Solutions:

Increase client timeout

import tritonclient.http as httpclient

client = httpclient.InferenceServerClient(
    url="localhost:8000",
    connection_timeout=60.0,  # Increase timeout
    network_timeout=120.0
)

Check server health

curl localhost:8000/v2/health/ready

Reduce request size

# Break large batch into smaller requests
batch_size = 32  # Instead of 128

Kubernetes Issues

Pod CrashLoopBackOff

Symptom: Triton pods keep restarting

Debug:

# Check pod status
kubectl get pods -l app=triton-server

# Check logs
kubectl logs <pod-name>

# Describe pod
kubectl describe pod <pod-name>

Common causes:

Insufficient memory

resources:
  limits:
    memory: "16Gi"  # Increase
  requests:
    memory: "8Gi"

Failed health check

readinessProbe:
  initialDelaySeconds: 60  # Increase if model loads slowly
  timeoutSeconds: 10

GPU not available

# Check GPU availability
kubectl get nodes -o json | jq '.items[].status.allocatable'

# Install NVIDIA device plugin if missing
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml

PersistentVolume Issues

Symptom: Cannot mount model repository

Solutions:

Check PV/PVC status

kubectl get pv
kubectl get pvc
kubectl describe pvc triton-models

Verify storage class

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: triton-models
spec:
  accessModes:
    - ReadWriteMany  # Important for multiple pods
  storageClassName: nfs-client
  resources:
    requests:
      storage: 100Gi

Debugging Tips

Enable Verbose Logging

tritonserver \
  --model-repository=/models \
  --log-verbose=1 \
  --log-info=1

Inspect Model Configuration

# Get auto-generated config
curl localhost:8000/v2/models/my_model/config

# Compare with config.pbtxt
cat models/my_model/config.pbtxt

Test with Simple Model

Create minimal test model to isolate issues:

# create_test_model.py
import torch
import torch.nn as nn

class SimpleModel(nn.Module):
    def forward(self, x):
        return x * 2

model = SimpleModel()
model.eval()

dummy_input = torch.randn(1, 10)
torch.onnx.export(model, dummy_input, "test_model.onnx")

# Create repository
mkdir -p models/test_model/1
mv test_model.onnx models/test_model/1/model.onnx

# Minimal config
cat > models/test_model/config.pbtxt << EOF
name: "test_model"
platform: "onnxruntime_onnx"
max_batch_size: 1
input [{ name: "input", data_type: TYPE_FP32, dims: [10] }]
output [{ name: "output", data_type: TYPE_FP32, dims: [10] }]
EOF

# Test
tritonserver --model-repository=./models

Performance Profiling

# Profile with perf_analyzer
perf_analyzer \
  -m model \
  -u localhost:8000 \
  --measurement-interval 5000 \
  --concurrency-range 1:32:4 \
  --latency-report-file latency.csv \
  --profile-export-file profile.json

# Analyze results
cat latency.csv

Getting Help

Collect Diagnostics

#!/bin/bash
# diagnostics.sh

echo "=== System Info ==="
uname -a
nvidia-smi

echo "=== Docker Info ==="
docker version
docker ps | grep triton

echo "=== Triton Logs ==="
docker logs <container_id> --tail 100

echo "=== Model Repository ==="
ls -R /path/to/models

echo "=== Server Health ==="
curl localhost:8000/v2/health/ready

echo "=== Metrics Sample ==="
curl localhost:8002/metrics | head -50

Community Resources

GitHub Issues: https://github.com/triton-inference-server/server/issues
NVIDIA Forums: https://forums.developer.nvidia.com/c/ai/triton-inference-server
Documentation: https://docs.nvidia.com/deeplearning/triton-inference-server/

Reporting Issues

When reporting issues, include:

Triton version: tritonserver --version
Model config: config.pbtxt
Server logs: docker logs <container_id>
Client code: Minimal reproducible example
Error messages: Full stack trace
System info: GPU, OS, Docker version

Next Steps

Review Best Practices for production deployments
Check Performance Optimization for tuning
Refer to Deployment for advanced setups

Server Issues​

Server Won't Start​

Server Crashes Randomly​

Model Loading Issues​

Model Shows as UNAVAILABLE​

Model Takes Too Long to Load​

Inference Issues​

Inference Requests Fail​

Slow Inference​

GPU Issues​

CUDA Errors​

GPU Not Detected​

Network Issues​

Cannot Connect to Server​

GRPC Connection Issues​

Performance Issues​

Low GPU Utilization​

High Queue Time​

Client Issues​

Python Client Import Error​

Timeout Errors​

Kubernetes Issues​

Pod CrashLoopBackOff​

PersistentVolume Issues​

Debugging Tips​

Enable Verbose Logging​

Inspect Model Configuration​

Test with Simple Model​

Performance Profiling​

Getting Help​

Collect Diagnostics​

Community Resources​

Reporting Issues​

Next Steps​

Server Issues

Server Won't Start

Server Crashes Randomly

Model Loading Issues

Model Shows as UNAVAILABLE

Model Takes Too Long to Load

Inference Issues

Inference Requests Fail

Slow Inference

GPU Issues

CUDA Errors

GPU Not Detected

Network Issues

Cannot Connect to Server

GRPC Connection Issues

Performance Issues

Low GPU Utilization

High Queue Time

Client Issues

Python Client Import Error

Timeout Errors

Kubernetes Issues

Pod CrashLoopBackOff

PersistentVolume Issues

Debugging Tips

Enable Verbose Logging

Inspect Model Configuration

Test with Simple Model

Performance Profiling

Getting Help

Collect Diagnostics

Community Resources

Reporting Issues

Next Steps