Quick Start Guide

Get started with Triton Inference Server by deploying your first model. This guide walks through a complete example using a simple PyTorch model.

Overview

In this quick start, you will:

Create a simple PyTorch model
Export it to ONNX format
Set up the model repository
Deploy with Triton
Send inference requests

Step 1: Prepare Your Model

Create and Train a Simple Model

Create a file train_model.py:

import torch
import torch.nn as nn

# Define a simple neural network
class SimpleModel(nn.Module):
    def __init__(self):
        super(SimpleModel, self).__init__()
        self.fc1 = nn.Linear(10, 20)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(20, 1)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

# Create and save the model
model = SimpleModel()
model.eval()

# Create dummy input for export
dummy_input = torch.randn(1, 10)

# Export to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    export_params=True,
    opset_version=11,
    input_names=['INPUT__0'],
    output_names=['OUTPUT__0'],
    dynamic_axes={
        'INPUT__0': {0: 'batch_size'},
        'OUTPUT__0': {0: 'batch_size'}
    }
)

print("Model exported to model.onnx")

Run the script:

python train_model.py

Step 2: Set Up Model Repository

Create Directory Structure

Triton requires a specific directory structure:

mkdir -p models/simple_onnx/1
mv model.onnx models/simple_onnx/1/

Directory structure:

models/
└── simple_onnx/
    ├── config.pbtxt
    └── 1/
        └── model.onnx

Create Model Configuration

Create models/simple_onnx/config.pbtxt:

name: "simple_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 8

input [
  {
    name: "INPUT__0"
    data_type: TYPE_FP32
    dims: [ 10 ]
  }
]

output [
  {
    name: "OUTPUT__0"
    data_type: TYPE_FP32
    dims: [ 1 ]
  }
]

instance_group [
  {
    count: 1
    kind: KIND_GPU
  }
]

dynamic_batching {
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 100
}

For CPU-only deployment, change kind: KIND_GPU to kind: KIND_CPU.

Step 3: Start Triton Server

Using Docker (GPU)

docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
    -v $(pwd)/models:/models \
    nvcr.io/nvidia/tritonserver:24.10-py3 \
    tritonserver --model-repository=/models

Using Docker (CPU)

docker run --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
    -v $(pwd)/models:/models \
    nvcr.io/nvidia/tritonserver:24.10-py3 \
    tritonserver --model-repository=/models

Expected Output

You should see output similar to:

I1019 10:30:00.123456 1 server.cc:626] 
+------------------+---------+--------+
| Model            | Version | Status |
+------------------+---------+--------+
| simple_onnx      | 1       | READY  |
+------------------+---------+--------+

I1019 10:30:00.123456 1 grpc_server.cc:4819] Started GRPCInferenceService at 0.0.0.0:8001
I1019 10:30:00.123456 1 http_server.cc:3477] Started HTTPService at 0.0.0.0:8000
I1019 10:30:00.123456 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002

Step 4: Verify Server is Running

Check Server Health

curl -v localhost:8000/v2/health/ready

Response:

HTTP/1.1 200 OK

Check Model Status

curl localhost:8000/v2/models/simple_onnx

Response:

{
  "name": "simple_onnx",
  "versions": ["1"],
  "platform": "onnxruntime_onnx",
  "inputs": [
    {
      "name": "INPUT__0",
      "datatype": "FP32",
      "shape": [-1, 10]
    }
  ],
  "outputs": [
    {
      "name": "OUTPUT__0",
      "datatype": "FP32",
      "shape": [-1, 1]
    }
  ]
}

Step 5: Send Inference Requests

Install Client Library

pip install tritonclient[http]

Create Client Script

Create client.py:

import tritonclient.http as httpclient
import numpy as np

# Create client
triton_client = httpclient.InferenceServerClient(url="localhost:8000")

# Prepare input data
input_data = np.random.randn(1, 10).astype(np.float32)

# Create input object
inputs = httpclient.InferInput("INPUT__0", input_data.shape, "FP32")
inputs.set_data_from_numpy(input_data)

# Create output object
outputs = httpclient.InferRequestedOutput("OUTPUT__0")

# Send inference request
results = triton_client.infer(
    model_name="simple_onnx",
    inputs=[inputs],
    outputs=[outputs]
)

# Get prediction
output_data = results.as_numpy("OUTPUT__0")
print(f"Input: {input_data}")
print(f"Output: {output_data}")

Run the client:

python client.py

Expected output:

Input: [[0.1234 -0.5678 ... ]]
Output: [[0.4321]]

Step 6: Using cURL for HTTP Requests

Inference Request with cURL

curl -X POST localhost:8000/v2/models/simple_onnx/infer \
  -H "Content-Type: application/json" \
  -d '{
    "inputs": [
      {
        "name": "INPUT__0",
        "shape": [1, 10],
        "datatype": "FP32",
        "data": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
      }
    ]
  }'

Response:

{
  "model_name": "simple_onnx",
  "model_version": "1",
  "outputs": [
    {
      "name": "OUTPUT__0",
      "datatype": "FP32",
      "shape": [1, 1],
      "data": [0.4321]
    }
  ]
}

Advanced Quick Start Examples

Example 1: TensorFlow SavedModel

# Directory structure
models/
└── tf_model/
    ├── config.pbtxt
    └── 1/
        └── model.savedmodel/

# config.pbtxt
name: "tf_model"
platform: "tensorflow_savedmodel"
max_batch_size: 8
input [...] 
output [...]

Example 2: PyTorch TorchScript

# Export PyTorch model
import torch

model = SimpleModel()
model.eval()
scripted_model = torch.jit.script(model)
scripted_model.save("model.pt")

# Directory structure
models/
└── pytorch_model/
    ├── config.pbtxt
    └── 1/
        └── model.pt

# config.pbtxt
name: "pytorch_model"
platform: "pytorch_libtorch"
max_batch_size: 8

Example 3: Python Backend (Custom Logic)

Create models/python_model/1/model.py:

import triton_python_backend_utils as pb_utils
import numpy as np

class TritonPythonModel:
    def initialize(self, args):
        print("Initializing Python model")
    
    def execute(self, requests):
        responses = []
        for request in requests:
            # Get input tensor
            in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
            input_data = in_0.as_numpy()
            
            # Process (example: multiply by 2)
            output_data = input_data * 2
            
            # Create output tensor
            out_tensor = pb_utils.Tensor("OUTPUT0", output_data)
            inference_response = pb_utils.InferenceResponse(
                output_tensors=[out_tensor]
            )
            responses.append(inference_response)
        
        return responses

Performance Testing

Using perf_analyzer

Triton provides perf_analyzer for benchmarking:

docker run -it --rm --net=host \
    nvcr.io/nvidia/tritonserver:24.10-py3-sdk \
    perf_analyzer \
    -m simple_onnx \
    -u localhost:8000 \
    --concurrency-range 1:4 \
    --shape INPUT__0:1,10

Output:

Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 1234.5 infer/sec, latency 809 usec
Concurrency: 2, throughput: 2345.6 infer/sec, latency 852 usec
Concurrency: 4, throughput: 3456.7 infer/sec, latency 1157 usec

Monitoring

Check Server Metrics

curl localhost:8002/metrics

Key metrics:

nv_inference_request_success: Successful inference requests
nv_inference_request_failure: Failed inference requests
nv_inference_count: Total inferences performed
nv_inference_exec_count: Model execution count
nv_gpu_utilization: GPU utilization percentage
nv_gpu_memory_total_bytes: Total GPU memory
nv_gpu_memory_used_bytes: Used GPU memory

Troubleshooting

Model Not Loading

Issue: Model shows as "UNAVAILABLE"

Check logs:

docker logs <container_id>

Common causes:

Incorrect config.pbtxt syntax
Missing model files
Incompatible model version
Insufficient GPU memory

Cannot Connect to Server

Issue: Connection refused on port 8000

Solutions:

Check if container is running: docker ps
Verify port mapping: -p 8000:8000
Check firewall rules
Ensure no other service uses the port

Performance Issues

Issue: Low throughput or high latency

Optimizations:

Enable dynamic batching
Increase instance count
Use GPU instead of CPU
Optimize model (quantization, pruning)

Next Steps

Congratulations! You've deployed your first model with Triton. Next, explore:

Model Repository - Advanced model configuration
Deployment - Production deployment strategies
Performance Optimization - Improve throughput and latency
Best Practices - Production-ready patterns

Overview​

Step 1: Prepare Your Model​

Create and Train a Simple Model​

Step 2: Set Up Model Repository​

Create Directory Structure​

Create Model Configuration​

Step 3: Start Triton Server​

Using Docker (GPU)​

Using Docker (CPU)​

Expected Output​

Step 4: Verify Server is Running​

Check Server Health​

Check Model Status​

Step 5: Send Inference Requests​

Install Client Library​

Create Client Script​

Step 6: Using cURL for HTTP Requests​

Inference Request with cURL​

Advanced Quick Start Examples​

Example 1: TensorFlow SavedModel​

Example 2: PyTorch TorchScript​

Example 3: Python Backend (Custom Logic)​

Performance Testing​

Using perf_analyzer​

Monitoring​

Check Server Metrics​

Troubleshooting​

Model Not Loading​

Cannot Connect to Server​

Performance Issues​

Next Steps​

Overview

Step 1: Prepare Your Model

Create and Train a Simple Model

Step 2: Set Up Model Repository

Create Directory Structure

Create Model Configuration

Step 3: Start Triton Server

Using Docker (GPU)

Using Docker (CPU)

Expected Output

Step 4: Verify Server is Running

Check Server Health

Check Model Status

Step 5: Send Inference Requests

Install Client Library

Create Client Script

Step 6: Using cURL for HTTP Requests

Inference Request with cURL

Advanced Quick Start Examples

Example 1: TensorFlow SavedModel

Example 2: PyTorch TorchScript

Example 3: Python Backend (Custom Logic)

Performance Testing

Using perf_analyzer

Monitoring

Check Server Metrics

Troubleshooting

Model Not Loading

Cannot Connect to Server

Performance Issues

Next Steps