Troubleshooting
Common issues and solutions when working with NVIDIA Triton Inference Server.
Server Issues
Server Won't Start
Symptom: Container exits immediately or won't start
Check logs:
docker logs <container_id>
Common causes and solutions:
- Invalid model repository path
# Error
E1019 10:00:00.000 server.cc:123] failed to stat '/models': No such file or directory
# Solution
# Verify path exists and is mounted correctly
ls -la /path/to/models
docker run -v /correct/path:/models ...
- Permission denied
# Error
E1019 10:00:00.000 server.cc:456] failed to load model: permission denied
# Solution
chmod -R 755 /path/to/models
chown -R $USER:$USER /path/to/models
- Port already in use
# Error
bind: address already in use
# Solution
# Find process using port
lsof -i :8000
# Kill process or use different port
docker run -p 8080:8000 ...
Server Crashes Randomly
Symptom: Server exits unexpectedly
Debugging steps:
- Check OOM (Out of Memory)
# Check system logs
dmesg | grep -i "out of memory"
# Monitor memory usage
docker stats
# Solution: Increase memory limit
docker run --memory=16g ...
- GPU out of memory
# Check GPU memory
nvidia-smi
# Solution: Reduce instances or batch size
instance_group [
{ count: 1, kind: KIND_GPU }
]
- Enable core dumps
ulimit -c unlimited
docker run --ulimit core=-1 ...
Model Loading Issues
Model Shows as UNAVAILABLE
Symptom: Model status is UNAVAILABLE
Check model status:
curl localhost:8000/v2/models/my_model
Common causes:
- Missing model files
# Verify structure
ls -R models/my_model/
# Should show:
# models/my_model/
# ├── config.pbtxt
# └── 1/
# └── model.onnx
- Invalid config.pbtxt
# Test config syntax
cat models/my_model/config.pbtxt
# Common errors:
# - Missing closing brackets
# - Wrong platform name
# - Invalid data types
- Incompatible model format
# Error
E1019 10:00:00.000 model.cc:789] unable to load model: version not supported
# Solution: Check platform compatibility
platform: "onnxruntime_onnx" # For ONNX
platform: "tensorflow_savedmodel" # For TensorFlow
platform: "pytorch_libtorch" # For PyTorch
Model Takes Too Long to Load
Symptom: Server hangs during model loading
Solutions:
- Increase timeout
tritonserver \
--model-repository=/models \
--backend-config=tensorflow,allow_growth=true \
--exit-timeout-secs=60
- Load models explicitly
tritonserver \
--model-repository=/models \
--model-control-mode=explicit \
--load-model=model1 \
--load-model=model2
- Use model warmup
model_warmup [
{
name: "warmup"
batch_size: 1
inputs {
key: "input"
value: {
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
zero_data: true
}
}
}
]
Inference Issues
Inference Requests Fail
Symptom: Getting errors on inference requests
Common errors:
- Input/Output name mismatch
# Error
Invalid argument: unexpected inference input 'INPUT', expecting 'input'
# Solution: Use correct names from model metadata
curl localhost:8000/v2/models/my_model/config
- Shape mismatch
# Error
Invalid argument: unexpected shape for input 'input', expecting [3,224,224], got [224,224,3]
# Solution: Verify input shape
import numpy as np
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32) # Correct
# Not: (1, 224, 224, 3)
- Data type mismatch
# Error
Invalid argument: unexpected datatype TYPE_FP64 for input 'input', expecting TYPE_FP32
# Solution: Use correct dtype
input_data = input_data.astype(np.float32) # Not float64
Slow Inference
Symptom: High latency or low throughput
Diagnosis:
# Use perf_analyzer
perf_analyzer -m my_model -u localhost:8000 --concurrency-range 1:8
# Check metrics
curl localhost:8002/metrics | grep nv_inference
Common causes:
- Not using dynamic batching
# Solution: Enable dynamic batching
dynamic_batching {
preferred_batch_size: [ 4, 8 ]
max_queue_delay_microseconds: 100
}
- Too many instances
# Bad: Too many instances competing for GPU
instance_group [
{ count: 8, kind: KIND_GPU }
]
# Good: Optimal instances
instance_group [
{ count: 2, kind: KIND_GPU }
]
- CPU backend on GPU model
# Change from CPU to GPU
instance_group [
{ count: 1, kind: KIND_GPU } # Not KIND_CPU
]
GPU Issues
CUDA Errors
Symptom: CUDA-related errors in logs
- CUDA out of memory
# Error
CUDA error: out of memory
# Solution 1: Reduce batch size
max_batch_size: 4 # Instead of 32
# Solution 2: Reduce instances
instance_group [
{ count: 1, kind: KIND_GPU } # Instead of 4
]
# Solution 3: Limit GPU memory
parameters {
key: "gpu_memory_fraction"
value: { string_value: "0.5" }
}
- CUDA driver version mismatch
# Error
CUDA driver version is insufficient
# Solution: Update NVIDIA drivers
sudo apt-get update
sudo apt-get install --reinstall nvidia-driver-535
nvidia-smi # Verify
GPU Not Detected
Symptom: Triton doesn't see GPU
Debugging:
# Check GPU visibility
nvidia-smi
# Check Docker GPU access
docker run --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
# If fails, install nvidia-container-toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
Network Issues
Cannot Connect to Server
Symptom: Connection refused or timeout
Debugging steps:
- Check server is running
docker ps | grep triton
- Check port binding
netstat -tulpn | grep 8000
- Test locally
# From host
curl localhost:8000/v2/health/ready
# From container
docker exec <container_id> curl localhost:8000/v2/health/ready
- Check firewall
# Ubuntu/Debian
sudo ufw status
sudo ufw allow 8000/tcp
# CentOS/RHEL
sudo firewall-cmd --add-port=8000/tcp --permanent
sudo firewall-cmd --reload
GRPC Connection Issues
Symptom: GRPC requests fail but HTTP works
Solutions:
- Check GRPC port
# Ensure port 8001 is exposed
docker run -p 8001:8001 ...
- Test GRPC connection
import tritonclient.grpc as grpcclient
try:
client = grpcclient.InferenceServerClient("localhost:8001")
print(client.is_server_live())
except Exception as e:
print(f"Connection failed: {e}")
Performance Issues
Low GPU Utilization
Symptom: GPU usage < 50%
Diagnosis:
# Monitor GPU
nvidia-smi dmon -s u
# Check metrics
curl localhost:8002/metrics | grep nv_gpu_utilization
Solutions:
- Increase concurrency
# Send more concurrent requests
perf_analyzer -m model --concurrency-range 8:16
- Enable dynamic batching
dynamic_batching {
preferred_batch_size: [ 8, 16 ]
}
- Increase instances
instance_group [
{ count: 2, kind: KIND_GPU }
]
High Queue Time
Symptom: Requests spending too long in queue
Check metrics:
curl localhost:8002/metrics | grep nv_inference_queue_duration_us
Solutions:
- Increase instance count
instance_group [
{ count: 4, kind: KIND_GPU } # More instances
]
- Adjust batching parameters
dynamic_batching {
max_queue_delay_microseconds: 50 # Reduce delay
preferred_batch_size: [ 4, 8 ]
}
- Add more servers
# Scale horizontally
kubectl scale deployment triton-server --replicas=3
Client Issues
Python Client Import Error
Symptom: Cannot import tritonclient
Solution:
# Install client
pip install tritonclient[all]
# Or specific protocol
pip install tritonclient[http]
pip install tritonclient[grpc]
# Verify
python -c "import tritonclient.http as httpclient; print('Success')"
Timeout Errors
Symptom: Client requests timeout
Solutions:
- Increase client timeout
import tritonclient.http as httpclient
client = httpclient.InferenceServerClient(
url="localhost:8000",
connection_timeout=60.0, # Increase timeout
network_timeout=120.0
)
- Check server health
curl localhost:8000/v2/health/ready
- Reduce request size
# Break large batch into smaller requests
batch_size = 32 # Instead of 128
Kubernetes Issues
Pod CrashLoopBackOff
Symptom: Triton pods keep restarting
Debug:
# Check pod status
kubectl get pods -l app=triton-server
# Check logs
kubectl logs <pod-name>
# Describe pod
kubectl describe pod <pod-name>
Common causes:
- Insufficient memory
resources:
limits:
memory: "16Gi" # Increase
requests:
memory: "8Gi"
- Failed health check
readinessProbe:
initialDelaySeconds: 60 # Increase if model loads slowly
timeoutSeconds: 10
- GPU not available
# Check GPU availability
kubectl get nodes -o json | jq '.items[].status.allocatable'
# Install NVIDIA device plugin if missing
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/nvidia-device-plugin.yml
PersistentVolume Issues
Symptom: Cannot mount model repository
Solutions:
- Check PV/PVC status
kubectl get pv
kubectl get pvc
kubectl describe pvc triton-models
- Verify storage class
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: triton-models
spec:
accessModes:
- ReadWriteMany # Important for multiple pods
storageClassName: nfs-client
resources:
requests:
storage: 100Gi
Debugging Tips
Enable Verbose Logging
tritonserver \
--model-repository=/models \
--log-verbose=1 \
--log-info=1
Inspect Model Configuration
# Get auto-generated config
curl localhost:8000/v2/models/my_model/config
# Compare with config.pbtxt
cat models/my_model/config.pbtxt
Test with Simple Model
Create minimal test model to isolate issues:
# create_test_model.py
import torch
import torch.nn as nn
class SimpleModel(nn.Module):
def forward(self, x):
return x * 2
model = SimpleModel()
model.eval()
dummy_input = torch.randn(1, 10)
torch.onnx.export(model, dummy_input, "test_model.onnx")
# Create repository
mkdir -p models/test_model/1
mv test_model.onnx models/test_model/1/model.onnx
# Minimal config
cat > models/test_model/config.pbtxt << EOF
name: "test_model"
platform: "onnxruntime_onnx"
max_batch_size: 1
input [{ name: "input", data_type: TYPE_FP32, dims: [10] }]
output [{ name: "output", data_type: TYPE_FP32, dims: [10] }]
EOF
# Test
tritonserver --model-repository=./models
Performance Profiling
# Profile with perf_analyzer
perf_analyzer \
-m model \
-u localhost:8000 \
--measurement-interval 5000 \
--concurrency-range 1:32:4 \
--latency-report-file latency.csv \
--profile-export-file profile.json
# Analyze results
cat latency.csv
Getting Help
Collect Diagnostics
#!/bin/bash
# diagnostics.sh
echo "=== System Info ==="
uname -a
nvidia-smi
echo "=== Docker Info ==="
docker version
docker ps | grep triton
echo "=== Triton Logs ==="
docker logs <container_id> --tail 100
echo "=== Model Repository ==="
ls -R /path/to/models
echo "=== Server Health ==="
curl localhost:8000/v2/health/ready
echo "=== Metrics Sample ==="
curl localhost:8002/metrics | head -50
Community Resources
- GitHub Issues: https://github.com/triton-inference-server/server/issues
- NVIDIA Forums: https://forums.developer.nvidia.com/c/ai/triton-inference-server
- Documentation: https://docs.nvidia.com/deeplearning/triton-inference-server/
Reporting Issues
When reporting issues, include:
- Triton version:
tritonserver --version - Model config:
config.pbtxt - Server logs:
docker logs <container_id> - Client code: Minimal reproducible example
- Error messages: Full stack trace
- System info: GPU, OS, Docker version
Next Steps
- Review Best Practices for production deployments
- Check Performance Optimization for tuning
- Refer to Deployment for advanced setups