Skip to main content

Troubleshooting

Common issues and solutions when working with NVIDIA Triton Inference Server.

Server Issues

Server Won't Start

Symptom: Container exits immediately or won't start

Check logs:

docker logs <container_id>

Common causes and solutions:

  1. Invalid model repository path
# Error
E1019 10:00:00.000 server.cc:123] failed to stat '/models': No such file or directory

# Solution
# Verify path exists and is mounted correctly
ls -la /path/to/models
docker run -v /correct/path:/models ...
  1. Permission denied
# Error
E1019 10:00:00.000 server.cc:456] failed to load model: permission denied

# Solution
chmod -R 755 /path/to/models
chown -R $USER:$USER /path/to/models
  1. Port already in use
# Error
bind: address already in use

# Solution
# Find process using port
lsof -i :8000
# Kill process or use different port
docker run -p 8080:8000 ...

Server Crashes Randomly

Symptom: Server exits unexpectedly

Debugging steps:

  1. Check OOM (Out of Memory)
# Check system logs
dmesg | grep -i "out of memory"

# Monitor memory usage
docker stats

# Solution: Increase memory limit
docker run --memory=16g ...
  1. GPU out of memory
# Check GPU memory
nvidia-smi

# Solution: Reduce instances or batch size
instance_group [
{ count: 1, kind: KIND_GPU }
]
  1. Enable core dumps
ulimit -c unlimited
docker run --ulimit core=-1 ...

Model Loading Issues

Model Shows as UNAVAILABLE

Symptom: Model status is UNAVAILABLE

Check model status:

curl localhost:8000/v2/models/my_model

Common causes:

  1. Missing model files
# Verify structure
ls -R models/my_model/
# Should show:
# models/my_model/
# ├── config.pbtxt
# └── 1/
# └── model.onnx
  1. Invalid config.pbtxt
# Test config syntax
cat models/my_model/config.pbtxt

# Common errors:
# - Missing closing brackets
# - Wrong platform name
# - Invalid data types
  1. Incompatible model format
# Error
E1019 10:00:00.000 model.cc:789] unable to load model: version not supported

# Solution: Check platform compatibility
platform: "onnxruntime_onnx" # For ONNX
platform: "tensorflow_savedmodel" # For TensorFlow
platform: "pytorch_libtorch" # For PyTorch

Model Takes Too Long to Load

Symptom: Server hangs during model loading

Solutions:

  1. Increase timeout
tritonserver \
--model-repository=/models \
--backend-config=tensorflow,allow_growth=true \
--exit-timeout-secs=60
  1. Load models explicitly
tritonserver \
--model-repository=/models \
--model-control-mode=explicit \
--load-model=model1 \
--load-model=model2
  1. Use model warmup
model_warmup [
{
name: "warmup"
batch_size: 1
inputs {
key: "input"
value: {
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
zero_data: true
}
}
}
]

Inference Issues

Inference Requests Fail

Symptom: Getting errors on inference requests

Common errors:

  1. Input/Output name mismatch
# Error
Invalid argument: unexpected inference input 'INPUT', expecting 'input'

# Solution: Use correct names from model metadata
curl localhost:8000/v2/models/my_model/config
  1. Shape mismatch
# Error
Invalid argument: unexpected shape for input 'input', expecting [3,224,224], got [224,224,3]

# Solution: Verify input shape
import numpy as np
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32) # Correct
# Not: (1, 224, 224, 3)
  1. Data type mismatch
# Error
Invalid argument: unexpected datatype TYPE_FP64 for input 'input', expecting TYPE_FP32

# Solution: Use correct dtype
input_data = input_data.astype(np.float32) # Not float64

Slow Inference

Symptom: High latency or low throughput

Diagnosis:

# Use perf_analyzer
perf_analyzer -m my_model -u localhost:8000 --concurrency-range 1:8

# Check metrics
curl localhost:8002/metrics | grep nv_inference

Common causes:

  1. Not using dynamic batching
# Solution: Enable dynamic batching
dynamic_batching {
preferred_batch_size: [ 4, 8 ]
max_queue_delay_microseconds: 100
}
  1. Too many instances
# Bad: Too many instances competing for GPU
instance_group [
{ count: 8, kind: KIND_GPU }
]

# Good: Optimal instances
instance_group [
{ count: 2, kind: KIND_GPU }
]
  1. CPU backend on GPU model
# Change from CPU to GPU
instance_group [
{ count: 1, kind: KIND_GPU } # Not KIND_CPU
]

GPU Issues

CUDA Errors

Symptom: CUDA-related errors in logs

  1. CUDA out of memory
# Error
CUDA error: out of memory

# Solution 1: Reduce batch size
max_batch_size: 4 # Instead of 32

# Solution 2: Reduce instances
instance_group [
{ count: 1, kind: KIND_GPU } # Instead of 4
]

# Solution 3: Limit GPU memory
parameters {
key: "gpu_memory_fraction"
value: { string_value: "0.5" }
}
  1. CUDA driver version mismatch
# Error
CUDA driver version is insufficient

# Solution: Update NVIDIA drivers
sudo apt-get update
sudo apt-get install --reinstall nvidia-driver-535
nvidia-smi # Verify

GPU Not Detected

Symptom: Triton doesn't see GPU

Debugging:

# Check GPU visibility
nvidia-smi

# Check Docker GPU access
docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

# If fails, install nvidia-container-toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

Network Issues

Cannot Connect to Server

Symptom: Connection refused or timeout

Debugging steps:

  1. Check server is running
docker ps | grep triton
  1. Check port binding
netstat -tulpn | grep 8000
  1. Test locally
# From host
curl localhost:8000/v2/health/ready

# From container
docker exec <container_id> curl localhost:8000/v2/health/ready
  1. Check firewall
# Ubuntu/Debian
sudo ufw status
sudo ufw allow 8000/tcp

# CentOS/RHEL
sudo firewall-cmd --add-port=8000/tcp --permanent
sudo firewall-cmd --reload

GRPC Connection Issues

Symptom: GRPC requests fail but HTTP works

Solutions:

  1. Check GRPC port
# Ensure port 8001 is exposed
docker run -p 8001:8001 ...
  1. Test GRPC connection
import tritonclient.grpc as grpcclient

try:
client = grpcclient.InferenceServerClient("localhost:8001")
print(client.is_server_live())
except Exception as e:
print(f"Connection failed: {e}")

Performance Issues

Low GPU Utilization

Symptom: GPU usage < 50%

Diagnosis:

# Monitor GPU
nvidia-smi dmon -s u

# Check metrics
curl localhost:8002/metrics | grep nv_gpu_utilization

Solutions:

  1. Increase concurrency
# Send more concurrent requests
perf_analyzer -m model --concurrency-range 8:16
  1. Enable dynamic batching
dynamic_batching {
preferred_batch_size: [ 8, 16 ]
}
  1. Increase instances
instance_group [
{ count: 2, kind: KIND_GPU }
]

High Queue Time

Symptom: Requests spending too long in queue

Check metrics:

curl localhost:8002/metrics | grep nv_inference_queue_duration_us

Solutions:

  1. Increase instance count
instance_group [
{ count: 4, kind: KIND_GPU } # More instances
]
  1. Adjust batching parameters
dynamic_batching {
max_queue_delay_microseconds: 50 # Reduce delay
preferred_batch_size: [ 4, 8 ]
}
  1. Add more servers
# Scale horizontally
kubectl scale deployment triton-server --replicas=3

Client Issues

Python Client Import Error

Symptom: Cannot import tritonclient

Solution:

# Install client
pip install tritonclient[all]

# Or specific protocol
pip install tritonclient[http]
pip install tritonclient[grpc]

# Verify
python -c "import tritonclient.http as httpclient; print('Success')"

Timeout Errors

Symptom: Client requests timeout

Solutions:

  1. Increase client timeout
import tritonclient.http as httpclient

client = httpclient.InferenceServerClient(
url="localhost:8000",
connection_timeout=60.0, # Increase timeout
network_timeout=120.0
)
  1. Check server health
curl localhost:8000/v2/health/ready
  1. Reduce request size
# Break large batch into smaller requests
batch_size = 32 # Instead of 128

Kubernetes Issues

Pod CrashLoopBackOff

Symptom: Triton pods keep restarting

Debug:

# Check pod status
kubectl get pods -l app=triton-server

# Check logs
kubectl logs <pod-name>

# Describe pod
kubectl describe pod <pod-name>

Common causes:

  1. Insufficient memory
resources:
limits:
memory: "16Gi" # Increase
requests:
memory: "8Gi"
  1. Failed health check
readinessProbe:
initialDelaySeconds: 60 # Increase if model loads slowly
timeoutSeconds: 10
  1. GPU not available
# Check GPU availability
kubectl get nodes -o json | jq '.items[].status.allocatable'

# Install NVIDIA device plugin if missing
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml

PersistentVolume Issues

Symptom: Cannot mount model repository

Solutions:

  1. Check PV/PVC status
kubectl get pv
kubectl get pvc
kubectl describe pvc triton-models
  1. Verify storage class
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: triton-models
spec:
accessModes:
- ReadWriteMany # Important for multiple pods
storageClassName: nfs-client
resources:
requests:
storage: 100Gi

Debugging Tips

Enable Verbose Logging

tritonserver \
--model-repository=/models \
--log-verbose=1 \
--log-info=1

Inspect Model Configuration

# Get auto-generated config
curl localhost:8000/v2/models/my_model/config

# Compare with config.pbtxt
cat models/my_model/config.pbtxt

Test with Simple Model

Create minimal test model to isolate issues:

# create_test_model.py
import torch
import torch.nn as nn

class SimpleModel(nn.Module):
def forward(self, x):
return x * 2

model = SimpleModel()
model.eval()

dummy_input = torch.randn(1, 10)
torch.onnx.export(model, dummy_input, "test_model.onnx")
# Create repository
mkdir -p models/test_model/1
mv test_model.onnx models/test_model/1/model.onnx

# Minimal config
cat > models/test_model/config.pbtxt << EOF
name: "test_model"
platform: "onnxruntime_onnx"
max_batch_size: 1
input [{ name: "input", data_type: TYPE_FP32, dims: [10] }]
output [{ name: "output", data_type: TYPE_FP32, dims: [10] }]
EOF

# Test
tritonserver --model-repository=./models

Performance Profiling

# Profile with perf_analyzer
perf_analyzer \
-m model \
-u localhost:8000 \
--measurement-interval 5000 \
--concurrency-range 1:32:4 \
--latency-report-file latency.csv \
--profile-export-file profile.json

# Analyze results
cat latency.csv

Getting Help

Collect Diagnostics

#!/bin/bash
# diagnostics.sh

echo "=== System Info ==="
uname -a
nvidia-smi

echo "=== Docker Info ==="
docker version
docker ps | grep triton

echo "=== Triton Logs ==="
docker logs <container_id> --tail 100

echo "=== Model Repository ==="
ls -R /path/to/models

echo "=== Server Health ==="
curl localhost:8000/v2/health/ready

echo "=== Metrics Sample ==="
curl localhost:8002/metrics | head -50

Community Resources

Reporting Issues

When reporting issues, include:

  1. Triton version: tritonserver --version
  2. Model config: config.pbtxt
  3. Server logs: docker logs <container_id>
  4. Client code: Minimal reproducible example
  5. Error messages: Full stack trace
  6. System info: GPU, OS, Docker version

Next Steps