Skip to main content

Quick Start Guide

Get started with Triton Inference Server by deploying your first model. This guide walks through a complete example using a simple PyTorch model.

Overview

In this quick start, you will:

  1. Create a simple PyTorch model
  2. Export it to ONNX format
  3. Set up the model repository
  4. Deploy with Triton
  5. Send inference requests

Step 1: Prepare Your Model

Create and Train a Simple Model

Create a file train_model.py:

import torch
import torch.nn as nn

# Define a simple neural network
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.fc1 = nn.Linear(10, 20)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(20, 1)

def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
return x

# Create and save the model
model = SimpleModel()
model.eval()

# Create dummy input for export
dummy_input = torch.randn(1, 10)

# Export to ONNX
torch.onnx.export(
model,
dummy_input,
"model.onnx",
export_params=True,
opset_version=11,
input_names=['INPUT__0'],
output_names=['OUTPUT__0'],
dynamic_axes={
'INPUT__0': {0: 'batch_size'},
'OUTPUT__0': {0: 'batch_size'}
}
)

print("Model exported to model.onnx")

Run the script:

python train_model.py

Step 2: Set Up Model Repository

Create Directory Structure

Triton requires a specific directory structure:

mkdir -p models/simple_onnx/1
mv model.onnx models/simple_onnx/1/

Directory structure:

models/
└── simple_onnx/
├── config.pbtxt
└── 1/
└── model.onnx

Create Model Configuration

Create models/simple_onnx/config.pbtxt:

name: "simple_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 8

input [
{
name: "INPUT__0"
data_type: TYPE_FP32
dims: [ 10 ]
}
]

output [
{
name: "OUTPUT__0"
data_type: TYPE_FP32
dims: [ 1 ]
}
]

instance_group [
{
count: 1
kind: KIND_GPU
}
]

dynamic_batching {
preferred_batch_size: [ 4, 8 ]
max_queue_delay_microseconds: 100
}

For CPU-only deployment, change kind: KIND_GPU to kind: KIND_CPU.

Step 3: Start Triton Server

Using Docker (GPU)

docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v $(pwd)/models:/models \
nvcr.io/nvidia/tritonserver:24.10-py3 \
tritonserver --model-repository=/models

Using Docker (CPU)

docker run --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v $(pwd)/models:/models \
nvcr.io/nvidia/tritonserver:24.10-py3 \
tritonserver --model-repository=/models

Expected Output

You should see output similar to:

I1019 10:30:00.123456 1 server.cc:626] 
+------------------+---------+--------+
| Model | Version | Status |
+------------------+---------+--------+
| simple_onnx | 1 | READY |
+------------------+---------+--------+

I1019 10:30:00.123456 1 grpc_server.cc:4819] Started GRPCInferenceService at 0.0.0.0:8001
I1019 10:30:00.123456 1 http_server.cc:3477] Started HTTPService at 0.0.0.0:8000
I1019 10:30:00.123456 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

Step 4: Verify Server is Running

Check Server Health

curl -v localhost:8000/v2/health/ready

Response:

HTTP/1.1 200 OK

Check Model Status

curl localhost:8000/v2/models/simple_onnx

Response:

{
"name": "simple_onnx",
"versions": ["1"],
"platform": "onnxruntime_onnx",
"inputs": [
{
"name": "INPUT__0",
"datatype": "FP32",
"shape": [-1, 10]
}
],
"outputs": [
{
"name": "OUTPUT__0",
"datatype": "FP32",
"shape": [-1, 1]
}
]
}

Step 5: Send Inference Requests

Install Client Library

pip install tritonclient[http]

Create Client Script

Create client.py:

import tritonclient.http as httpclient
import numpy as np

# Create client
triton_client = httpclient.InferenceServerClient(url="localhost:8000")

# Prepare input data
input_data = np.random.randn(1, 10).astype(np.float32)

# Create input object
inputs = httpclient.InferInput("INPUT__0", input_data.shape, "FP32")
inputs.set_data_from_numpy(input_data)

# Create output object
outputs = httpclient.InferRequestedOutput("OUTPUT__0")

# Send inference request
results = triton_client.infer(
model_name="simple_onnx",
inputs=[inputs],
outputs=[outputs]
)

# Get prediction
output_data = results.as_numpy("OUTPUT__0")
print(f"Input: {input_data}")
print(f"Output: {output_data}")

Run the client:

python client.py

Expected output:

Input: [[0.1234 -0.5678 ... ]]
Output: [[0.4321]]

Step 6: Using cURL for HTTP Requests

Inference Request with cURL

curl -X POST localhost:8000/v2/models/simple_onnx/infer \
-H "Content-Type: application/json" \
-d '{
"inputs": [
{
"name": "INPUT__0",
"shape": [1, 10],
"datatype": "FP32",
"data": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
}
]
}'

Response:

{
"model_name": "simple_onnx",
"model_version": "1",
"outputs": [
{
"name": "OUTPUT__0",
"datatype": "FP32",
"shape": [1, 1],
"data": [0.4321]
}
]
}

Advanced Quick Start Examples

Example 1: TensorFlow SavedModel

# Directory structure
models/
└── tf_model/
├── config.pbtxt
└── 1/
└── model.savedmodel/

# config.pbtxt
name: "tf_model"
platform: "tensorflow_savedmodel"
max_batch_size: 8
input [...]
output [...]

Example 2: PyTorch TorchScript

# Export PyTorch model
import torch

model = SimpleModel()
model.eval()
scripted_model = torch.jit.script(model)
scripted_model.save("model.pt")
# Directory structure
models/
└── pytorch_model/
├── config.pbtxt
└── 1/
└── model.pt

# config.pbtxt
name: "pytorch_model"
platform: "pytorch_libtorch"
max_batch_size: 8

Example 3: Python Backend (Custom Logic)

Create models/python_model/1/model.py:

import triton_python_backend_utils as pb_utils
import numpy as np

class TritonPythonModel:
def initialize(self, args):
print("Initializing Python model")

def execute(self, requests):
responses = []
for request in requests:
# Get input tensor
in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
input_data = in_0.as_numpy()

# Process (example: multiply by 2)
output_data = input_data * 2

# Create output tensor
out_tensor = pb_utils.Tensor("OUTPUT0", output_data)
inference_response = pb_utils.InferenceResponse(
output_tensors=[out_tensor]
)
responses.append(inference_response)

return responses

Performance Testing

Using perf_analyzer

Triton provides perf_analyzer for benchmarking:

docker run -it --rm --net=host \
nvcr.io/nvidia/tritonserver:24.10-py3-sdk \
perf_analyzer \
-m simple_onnx \
-u localhost:8000 \
--concurrency-range 1:4 \
--shape INPUT__0:1,10

Output:

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 1234.5 infer/sec, latency 809 usec
Concurrency: 2, throughput: 2345.6 infer/sec, latency 852 usec
Concurrency: 4, throughput: 3456.7 infer/sec, latency 1157 usec

Monitoring

Check Server Metrics

curl localhost:8002/metrics

Key metrics:

  • nv_inference_request_success: Successful inference requests
  • nv_inference_request_failure: Failed inference requests
  • nv_inference_count: Total inferences performed
  • nv_inference_exec_count: Model execution count
  • nv_gpu_utilization: GPU utilization percentage
  • nv_gpu_memory_total_bytes: Total GPU memory
  • nv_gpu_memory_used_bytes: Used GPU memory

Troubleshooting

Model Not Loading

Issue: Model shows as "UNAVAILABLE"

Check logs:

docker logs <container_id>

Common causes:

  • Incorrect config.pbtxt syntax
  • Missing model files
  • Incompatible model version
  • Insufficient GPU memory

Cannot Connect to Server

Issue: Connection refused on port 8000

Solutions:

  1. Check if container is running: docker ps
  2. Verify port mapping: -p 8000:8000
  3. Check firewall rules
  4. Ensure no other service uses the port

Performance Issues

Issue: Low throughput or high latency

Optimizations:

  1. Enable dynamic batching
  2. Increase instance count
  3. Use GPU instead of CPU
  4. Optimize model (quantization, pruning)

Next Steps

Congratulations! You've deployed your first model with Triton. Next, explore: