Model Repository
The model repository is a file-system based repository of the models that Triton will make available for inferencing. This guide covers how to organize, configure, and manage models in the repository.
Repository Structure
Basic Structure
model_repository/
├── model_1/
│ ├── config.pbtxt
│ ├── 1/
│ │ └── model.onnx
│ └── 2/
│ └── model.onnx
├── model_2/
│ ├── config.pbtxt
│ └── 1/
│ └── model.savedmodel/
└── model_3/
├── config.pbtxt
└── 1/
└── model.pt
Key Components
- Model Directory: Named after the model (e.g.,
resnet50) - Version Directories: Numeric directories (1, 2, 3, etc.)
- Model Files: The actual model artifacts
- Configuration File:
config.pbtxt(optional but recommended)
Model Configuration
Minimal Configuration
For many models, Triton can auto-generate configuration:
name: "my_model"
platform: "onnxruntime_onnx"
Complete Configuration Example
name: "resnet50"
platform: "onnxruntime_onnx"
backend: "onnxruntime"
max_batch_size: 8
default_model_filename: "model.onnx"
# Input specification
input [
{
name: "input_image"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: [ 3, 224, 224 ]
}
]
# Output specification
output [
{
name: "predictions"
data_type: TYPE_FP32
dims: [ 1000 ]
}
]
# Instance configuration
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0 ]
}
]
# Batching configuration
dynamic_batching {
preferred_batch_size: [ 4, 8 ]
max_queue_delay_microseconds: 100
}
# Version policy
version_policy: { latest { num_versions: 2 }}
Platform Specifications
ONNX Runtime
name: "onnx_model"
platform: "onnxruntime_onnx"
default_model_filename: "model.onnx"
optimization {
execution_accelerators {
gpu_execution_accelerator : [ {
name : "tensorrt"
parameters { key: "precision_mode" value: "FP16" }
}]
}
}
TensorFlow SavedModel
name: "tf_model"
platform: "tensorflow_savedmodel"
# GPU memory fraction
parameters: {
key: "gpu_memory_fraction"
value: { string_value: "0.5" }
}
PyTorch TorchScript
name: "pytorch_model"
platform: "pytorch_libtorch"
max_batch_size: 0 # For models with batch in the model itself
input [
{
name: "input"
data_type: TYPE_FP32
dims: [ 1, 3, 224, 224 ]
}
]
TensorRT Plan
name: "trt_model"
platform: "tensorrt_plan"
max_batch_size: 8
input [
{
name: "input"
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
}
]
optimization {
cuda {
graphs: true
}
}
Python Backend
name: "python_model"
backend: "python"
max_batch_size: 8
input [
{
name: "INPUT0"
data_type: TYPE_FP32
dims: [ -1 ]
}
]
output [
{
name: "OUTPUT0"
data_type: TYPE_FP32
dims: [ -1 ]
}
]
instance_group [
{
count: 2
kind: KIND_CPU
}
]
Input and Output Configuration
Data Types
# Common data types
TYPE_BOOL
TYPE_UINT8
TYPE_UINT16
TYPE_UINT32
TYPE_UINT64
TYPE_INT8
TYPE_INT16
TYPE_INT32
TYPE_INT64
TYPE_FP16
TYPE_FP32
TYPE_FP64
TYPE_STRING
Dynamic Shapes
input [
{
name: "input"
data_type: TYPE_FP32
dims: [ -1, 3, -1, -1 ] # -1 indicates dynamic dimension
}
]
Reshape
input [
{
name: "flat_input"
data_type: TYPE_FP32
dims: [ 784 ]
reshape: { shape: [ 1, 28, 28 ] }
}
]
Instance Configuration
Multiple GPU Instances
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0, 1 ] # Use GPUs 0 and 1
}
]
CPU and GPU Mix
instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
},
{
count: 2
kind: KIND_CPU
}
]
Rate Limiting
instance_group [
{
count: 1
kind: KIND_GPU
rate_limiter {
resources [
{
name: "R1"
count: 4
}
]
}
}
]
Batching Strategies
Dynamic Batching
Automatically combines individual requests into batches:
dynamic_batching {
# Preferred batch sizes (Triton will try to create these)
preferred_batch_size: [ 4, 8 ]
# Maximum time to wait before sending a batch
max_queue_delay_microseconds: 100
# Preserve ordering of requests
preserve_ordering: false
# Priority levels
priority_levels: 3
default_priority_level: 1
# Queue policy
default_queue_policy {
timeout_action: REJECT
default_timeout_microseconds: 10000
allow_timeout_override: true
max_queue_size: 10
}
}
Sequence Batching
For stateful models (RNNs, transformers with state):
sequence_batching {
max_sequence_idle_microseconds: 5000000
control_input [
{
name: "START"
control [
{
kind: CONTROL_SEQUENCE_START
fp32_false_true: [ 0, 1 ]
}
]
},
{
name: "READY"
control [
{
kind: CONTROL_SEQUENCE_READY
fp32_false_true: [ 0, 1 ]
}
]
}
]
state [
{
input_name: "state_in"
output_name: "state_out"
data_type: TYPE_FP32
dims: [ 512 ]
initial_state {
data_type: TYPE_FP32
dims: [ 512 ]
zero_data: true
}
}
]
}
Model Versioning
Version Policies
Latest N Versions:
version_policy: { latest { num_versions: 2 }}
All Versions:
version_policy: { all { }}
Specific Versions:
version_policy: {
specific {
versions: [1, 3, 5]
}
}
Version Labels
Create labels.txt in model directory:
stable 2
canary 3
latest 3
Model Ensembles
Create pipelines by chaining multiple models:
name: "ensemble_model"
platform: "ensemble"
max_batch_size: 8
input [
{
name: "IMAGE"
data_type: TYPE_UINT8
dims: [ -1, -1, 3 ]
}
]
output [
{
name: "CLASSIFICATION"
data_type: TYPE_FP32
dims: [ 1000 ]
}
]
ensemble_scheduling {
step [
{
model_name: "preprocessing"
model_version: -1
input_map {
key: "RAW_IMAGE"
value: "IMAGE"
}
output_map {
key: "PREPROCESSED_IMAGE"
value: "preprocessed_image"
}
},
{
model_name: "resnet50"
model_version: -1
input_map {
key: "input"
value: "preprocessed_image"
}
output_map {
key: "output"
value: "CLASSIFICATION"
}
}
]
}
Model Warmup
Pre-load model and allocate resources:
model_warmup [
{
name: "warmup_sample"
batch_size: 8
inputs {
key: "input"
value: {
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
zero_data: true
}
}
}
]
Cloud Storage
AWS S3
tritonserver --model-repository=s3://bucket-name/model-repository
Set credentials:
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-west-2
Google Cloud Storage
tritonserver --model-repository=gs://bucket-name/model-repository
Set credentials:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
Azure Blob Storage
tritonserver --model-repository=as://account-name/container/model-repository
Set credentials:
export AZURE_STORAGE_ACCOUNT=account_name
export AZURE_STORAGE_KEY=account_key
Model Control API
Load Model
curl -X POST localhost:8000/v2/repository/models/my_model/load
Unload Model
curl -X POST localhost:8000/v2/repository/models/my_model/unload
Reload All Models
curl -X POST localhost:8000/v2/repository/index
Get Model Status
curl localhost:8000/v2/models/my_model
Model Repository Polling
Enable automatic model updates:
tritonserver \
--model-repository=/models \
--model-control-mode=poll \
--repository-poll-secs=30
Best Practices
Organization
- Use Semantic Versioning: Name versions meaningfully (use labels)
- Keep Old Versions: Retain at least 2-3 versions for rollback
- Document Changes: Include version notes in a README
- Test Before Deploy: Validate new versions in staging
Configuration
- Start Simple: Use auto-configuration first, then optimize
- Profile Your Model: Use perf_analyzer before production
- Monitor Resources: Track GPU memory and utilization
- Set Limits: Configure max_batch_size appropriately
Performance
- Enable Dynamic Batching: For most workloads
- Choose Right Instance Count: Balance throughput and latency
- Use TensorRT: For NVIDIA GPUs when possible
- Optimize Input Pipeline: Pre-process data efficiently
Security
- Validate Inputs: Check input shapes and types
- Resource Limits: Set memory and compute limits
- Access Control: Restrict repository access
- Audit Logs: Monitor model loading/unloading
Troubleshooting
Model Won't Load
Check logs:
docker logs <container_id> 2>&1 | grep -i error
Common issues:
- Invalid config.pbtxt syntax
- Missing model files
- Incompatible platform/backend
- Insufficient GPU memory
Version Not Found
Verify version policy in config.pbtxt and ensure version directory exists.
Performance Issues
- Check instance count
- Verify dynamic batching settings
- Review GPU utilization metrics
- Profile with perf_analyzer
Next Steps
- Deployment - Deploy Triton in production
- Performance Optimization - Tune for best performance
- Best Practices - Production-ready configurations