Skip to main content

Installation

This guide covers different ways to install and run NVIDIA Triton Inference Server in your environment.

Prerequisites

Hardware Requirements

Minimum:

  • CPU: 4 cores
  • RAM: 8 GB
  • Disk: 10 GB free space

Recommended for GPU:

  • NVIDIA GPU with Compute Capability 6.0+ (Pascal or newer)
  • NVIDIA Driver: 450.80.02+ for CUDA 11.0+
  • 16 GB RAM
  • 50 GB free disk space

Software Requirements

  • Docker 19.03+ (for containerized deployment)
  • NVIDIA Container Toolkit (for GPU support)
  • Kubernetes 1.19+ (for K8s deployment)
  • Python 3.8+ (for client libraries)

Installation Methods

The easiest way to get started with Triton is using pre-built Docker images.

Pull the Triton Image

For GPU environments:

docker pull nvcr.io/nvidia/tritonserver:24.10-py3

For CPU-only environments:

docker pull nvcr.io/nvidia/tritonserver:24.10-py3-min

Verify Installation

Check the server version:

docker run --rm nvcr.io/nvidia/tritonserver:24.10-py3 tritonserver --version

Expected output:

tritonserver 2.50.0

Method 2: Build from Source

For custom requirements or development, build from source.

Clone the Repository

git clone https://github.com/triton-inference-server/server.git
cd server

Build Using Docker

# For GPU build
python3 build.py --enable-all --backend=all

# For CPU-only build
python3 build.py --enable-all --backend=all --endpoint=http --endpoint=grpc

This process can take 1-2 hours depending on your system.

Method 3: Kubernetes Deployment

Deploy Triton on Kubernetes using Helm.

Add the Helm Repository

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

Install Triton

helm install triton-inference-server nvidia/triton-inference-server \
--set image.tag=24.10-py3 \
--set modelRepositoryPath=/models

Verify the Deployment

kubectl get pods -l app=triton-inference-server
kubectl logs -f <triton-pod-name>

Method 4: Cloud Platforms

AWS SageMaker

Deploy using AWS SageMaker Multi-Model Server:

import sagemaker
from sagemaker.triton import TritonModel

triton_model = TritonModel(
model_uri="s3://your-bucket/model-repository/",
role="your-sagemaker-role",
image_uri="nvcr.io/nvidia/tritonserver:24.10-py3"
)

predictor = triton_model.deploy(
instance_type="ml.g4dn.xlarge",
initial_instance_count=1
)

Google Cloud Platform (GCP)

Deploy on GKE:

gcloud container clusters create triton-cluster \
--machine-type n1-standard-4 \
--num-nodes 2 \
--accelerator type=nvidia-tesla-t4,count=1

kubectl apply -f triton-deployment.yaml

Azure

Deploy on AKS with GPU:

az aks create \
--resource-group myResourceGroup \
--name tritonCluster \
--node-count 2 \
--node-vm-size Standard_NC6 \
--generate-ssh-keys

kubectl apply -f triton-deployment.yaml

Install NVIDIA Container Toolkit (for GPU)

Required for running Triton with GPU support on Docker.

Ubuntu/Debian

# Add NVIDIA package repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list

# Install nvidia-container-toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

# Restart Docker
sudo systemctl restart docker

CentOS/RHEL

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | \
sudo tee /etc/yum.repos.d/nvidia-docker.repo

sudo yum install -y nvidia-container-toolkit
sudo systemctl restart docker

Verify GPU Access

docker run --rm --gpus all nvcr.io/nvidia/tritonserver:24.10-py3 nvidia-smi

Install Client Libraries

Install Python client library for sending requests to Triton.

Using pip

pip install tritonclient[all]

Or install specific protocols:

# HTTP only
pip install tritonclient[http]

# GRPC only
pip install tritonclient[grpc]

Verify Client Installation

import tritonclient.http as httpclient

# This should not raise any errors
print("Triton HTTP client installed successfully")

Quick Verification

Test that everything is working correctly.

1. Create a Test Directory

mkdir -p /tmp/triton-test/models

2. Start Triton Server

docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v /tmp/triton-test/models:/models \
nvcr.io/nvidia/tritonserver:24.10-py3 \
tritonserver --model-repository=/models

3. Check Server Status

In another terminal:

curl -v localhost:8000/v2/health/ready

Expected response:

HTTP/1.1 200 OK

Environment Variables

Configure Triton using environment variables:

VariableDescriptionDefault
TRITON_MODEL_REPOSITORYPath to model repository/models
TRITON_LOG_VERBOSEEnable verbose logging0
TRITON_MIN_COMPUTE_CAPABILITYMinimum GPU compute capability6.0
TRITON_SERVER_THREAD_COUNTNumber of server threadsAuto
CUDA_VISIBLE_DEVICESGPUs visible to TritonAll

Example:

docker run --gpus all -e TRITON_LOG_VERBOSE=1 \
-v /models:/models \
nvcr.io/nvidia/tritonserver:24.10-py3

Common Installation Issues

Issue: CUDA Driver Version Mismatch

Error:

CUDA driver version is insufficient for CUDA runtime version

Solution: Update your NVIDIA drivers:

# Ubuntu
sudo apt-get update
sudo apt-get install --reinstall nvidia-driver-535

# Verify
nvidia-smi

Issue: Permission Denied on Model Repository

Error:

failed to load model: permission denied

Solution: Fix directory permissions:

chmod -R 755 /path/to/models

Issue: Out of GPU Memory

Error:

out of memory

Solution:

  • Limit GPU memory per model in config
  • Reduce model instance count
  • Use smaller batch sizes

Next Steps

Now that Triton is installed, you can: