Quick Start Guide
Get started with Triton Inference Server by deploying your first model. This guide walks through a complete example using a simple PyTorch model.
Overview
In this quick start, you will:
- Create a simple PyTorch model
- Export it to ONNX format
- Set up the model repository
- Deploy with Triton
- Send inference requests
Step 1: Prepare Your Model
Create and Train a Simple Model
Create a file train_model.py:
import torch
import torch.nn as nn
# Define a simple neural network
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.fc1 = nn.Linear(10, 20)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(20, 1)
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
return x
# Create and save the model
model = SimpleModel()
model.eval()
# Create dummy input for export
dummy_input = torch.randn(1, 10)
# Export to ONNX
torch.onnx.export(
model,
dummy_input,
"model.onnx",
export_params=True,
opset_version=11,
input_names=['INPUT__0'],
output_names=['OUTPUT__0'],
dynamic_axes={
'INPUT__0': {0: 'batch_size'},
'OUTPUT__0': {0: 'batch_size'}
}
)
print("Model exported to model.onnx")
Run the script:
python train_model.py
Step 2: Set Up Model Repository
Create Directory Structure
Triton requires a specific directory structure:
mkdir -p models/simple_onnx/1
mv model.onnx models/simple_onnx/1/
Directory structure:
models/
└── simple_onnx/
├── config.pbtxt
└── 1/
└── model.onnx
Create Model Configuration
Create models/simple_onnx/config.pbtxt:
name: "simple_onnx"
platform: "onnxruntime_onnx"
max_batch_size: 8
input [
{
name: "INPUT__0"
data_type: TYPE_FP32
dims: [ 10 ]
}
]
output [
{
name: "OUTPUT__0"
data_type: TYPE_FP32
dims: [ 1 ]
}
]
instance_group [
{
count: 1
kind: KIND_GPU
}
]
dynamic_batching {
preferred_batch_size: [ 4, 8 ]
max_queue_delay_microseconds: 100
}
For CPU-only deployment, change kind: KIND_GPU to kind: KIND_CPU.
Step 3: Start Triton Server
Using Docker (GPU)
docker run --gpus all --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v $(pwd)/models:/models \
nvcr.io/nvidia/tritonserver:24.10-py3 \
tritonserver --model-repository=/models
Using Docker (CPU)
docker run --rm -p 8000:8000 -p 8001:8001 -p 8002:8002 \
-v $(pwd)/models:/models \
nvcr.io/nvidia/tritonserver:24.10-py3 \
tritonserver --model-repository=/models
Expected Output
You should see output similar to:
I1019 10:30:00.123456 1 server.cc:626]
+------------------+---------+--------+
| Model | Version | Status |
+------------------+---------+--------+
| simple_onnx | 1 | READY |
+------------------+---------+--------+
I1019 10:30:00.123456 1 grpc_server.cc:4819] Started GRPCInferenceService at 0.0.0.0:8001
I1019 10:30:00.123456 1 http_server.cc:3477] Started HTTPService at 0.0.0.0:8000
I1019 10:30:00.123456 1 http_server.cc:187] Started Metrics Service at 0.0.0.0:8002
Step 4: Verify Server is Running
Check Server Health
curl -v localhost:8000/v2/health/ready
Response:
HTTP/1.1 200 OK
Check Model Status
curl localhost:8000/v2/models/simple_onnx
Response:
{
"name": "simple_onnx",
"versions": ["1"],
"platform": "onnxruntime_onnx",
"inputs": [
{
"name": "INPUT__0",
"datatype": "FP32",
"shape": [-1, 10]
}
],
"outputs": [
{
"name": "OUTPUT__0",
"datatype": "FP32",
"shape": [-1, 1]
}
]
}
Step 5: Send Inference Requests
Install Client Library
pip install tritonclient[http]
Create Client Script
Create client.py:
import tritonclient.http as httpclient
import numpy as np
# Create client
triton_client = httpclient.InferenceServerClient(url="localhost:8000")
# Prepare input data
input_data = np.random.randn(1, 10).astype(np.float32)
# Create input object
inputs = httpclient.InferInput("INPUT__0", input_data.shape, "FP32")
inputs.set_data_from_numpy(input_data)
# Create output object
outputs = httpclient.InferRequestedOutput("OUTPUT__0")
# Send inference request
results = triton_client.infer(
model_name="simple_onnx",
inputs=[inputs],
outputs=[outputs]
)
# Get prediction
output_data = results.as_numpy("OUTPUT__0")
print(f"Input: {input_data}")
print(f"Output: {output_data}")
Run the client:
python client.py
Expected output:
Input: [[0.1234 -0.5678 ... ]]
Output: [[0.4321]]
Step 6: Using cURL for HTTP Requests
Inference Request with cURL
curl -X POST localhost:8000/v2/models/simple_onnx/infer \
-H "Content-Type: application/json" \
-d '{
"inputs": [
{
"name": "INPUT__0",
"shape": [1, 10],
"datatype": "FP32",
"data": [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
}
]
}'
Response:
{
"model_name": "simple_onnx",
"model_version": "1",
"outputs": [
{
"name": "OUTPUT__0",
"datatype": "FP32",
"shape": [1, 1],
"data": [0.4321]
}
]
}
Advanced Quick Start Examples
Example 1: TensorFlow SavedModel
# Directory structure
models/
└── tf_model/
├── config.pbtxt
└── 1/
└── model.savedmodel/
# config.pbtxt
name: "tf_model"
platform: "tensorflow_savedmodel"
max_batch_size: 8
input [...]
output [...]
Example 2: PyTorch TorchScript
# Export PyTorch model
import torch
model = SimpleModel()
model.eval()
scripted_model = torch.jit.script(model)
scripted_model.save("model.pt")
# Directory structure
models/
└── pytorch_model/
├── config.pbtxt
└── 1/
└── model.pt
# config.pbtxt
name: "pytorch_model"
platform: "pytorch_libtorch"
max_batch_size: 8
Example 3: Python Backend (Custom Logic)
Create models/python_model/1/model.py:
import triton_python_backend_utils as pb_utils
import numpy as np
class TritonPythonModel:
def initialize(self, args):
print("Initializing Python model")
def execute(self, requests):
responses = []
for request in requests:
# Get input tensor
in_0 = pb_utils.get_input_tensor_by_name(request, "INPUT0")
input_data = in_0.as_numpy()
# Process (example: multiply by 2)
output_data = input_data * 2
# Create output tensor
out_tensor = pb_utils.Tensor("OUTPUT0", output_data)
inference_response = pb_utils.InferenceResponse(
output_tensors=[out_tensor]
)
responses.append(inference_response)
return responses
Performance Testing
Using perf_analyzer
Triton provides perf_analyzer for benchmarking:
docker run -it --rm --net=host \
nvcr.io/nvidia/tritonserver:24.10-py3-sdk \
perf_analyzer \
-m simple_onnx \
-u localhost:8000 \
--concurrency-range 1:4 \
--shape INPUT__0:1,10
Output:
Inferences/Second vs. Client Average Batch Latency
Concurrency: 1, throughput: 1234.5 infer/sec, latency 809 usec
Concurrency: 2, throughput: 2345.6 infer/sec, latency 852 usec
Concurrency: 4, throughput: 3456.7 infer/sec, latency 1157 usec
Monitoring
Check Server Metrics
curl localhost:8002/metrics
Key metrics:
nv_inference_request_success: Successful inference requestsnv_inference_request_failure: Failed inference requestsnv_inference_count: Total inferences performednv_inference_exec_count: Model execution countnv_gpu_utilization: GPU utilization percentagenv_gpu_memory_total_bytes: Total GPU memorynv_gpu_memory_used_bytes: Used GPU memory
Troubleshooting
Model Not Loading
Issue: Model shows as "UNAVAILABLE"
Check logs:
docker logs <container_id>
Common causes:
- Incorrect config.pbtxt syntax
- Missing model files
- Incompatible model version
- Insufficient GPU memory
Cannot Connect to Server
Issue: Connection refused on port 8000
Solutions:
- Check if container is running:
docker ps - Verify port mapping:
-p 8000:8000 - Check firewall rules
- Ensure no other service uses the port
Performance Issues
Issue: Low throughput or high latency
Optimizations:
- Enable dynamic batching
- Increase instance count
- Use GPU instead of CPU
- Optimize model (quantization, pruning)
Next Steps
Congratulations! You've deployed your first model with Triton. Next, explore:
- Model Repository - Advanced model configuration
- Deployment - Production deployment strategies
- Performance Optimization - Improve throughput and latency
- Best Practices - Production-ready patterns