Skip to main content

Model Repository

The model repository is a file-system based repository of the models that Triton will make available for inferencing. This guide covers how to organize, configure, and manage models in the repository.

Repository Structure

Basic Structure

model_repository/
├── model_1/
│ ├── config.pbtxt
│ ├── 1/
│ │ └── model.onnx
│ └── 2/
│ └── model.onnx
├── model_2/
│ ├── config.pbtxt
│ └── 1/
│ └── model.savedmodel/
└── model_3/
├── config.pbtxt
└── 1/
└── model.pt

Key Components

  1. Model Directory: Named after the model (e.g., resnet50)
  2. Version Directories: Numeric directories (1, 2, 3, etc.)
  3. Model Files: The actual model artifacts
  4. Configuration File: config.pbtxt (optional but recommended)

Model Configuration

Minimal Configuration

For many models, Triton can auto-generate configuration:

name: "my_model"
platform: "onnxruntime_onnx"

Complete Configuration Example

name: "resnet50"
platform: "onnxruntime_onnx"
backend: "onnxruntime"
max_batch_size: 8
default_model_filename: "model.onnx"

# Input specification
input [
{
name: "input_image"
data_type: TYPE_FP32
format: FORMAT_NCHW
dims: [ 3, 224, 224 ]
}
]

# Output specification
output [
{
name: "predictions"
data_type: TYPE_FP32
dims: [ 1000 ]
}
]

# Instance configuration
instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0 ]
}
]

# Batching configuration
dynamic_batching {
preferred_batch_size: [ 4, 8 ]
max_queue_delay_microseconds: 100
}

# Version policy
version_policy: { latest { num_versions: 2 }}

Platform Specifications

ONNX Runtime

name: "onnx_model"
platform: "onnxruntime_onnx"
default_model_filename: "model.onnx"

optimization {
execution_accelerators {
gpu_execution_accelerator : [ {
name : "tensorrt"
parameters { key: "precision_mode" value: "FP16" }
}]
}
}

TensorFlow SavedModel

name: "tf_model"
platform: "tensorflow_savedmodel"

# GPU memory fraction
parameters: {
key: "gpu_memory_fraction"
value: { string_value: "0.5" }
}

PyTorch TorchScript

name: "pytorch_model"
platform: "pytorch_libtorch"
max_batch_size: 0 # For models with batch in the model itself

input [
{
name: "input"
data_type: TYPE_FP32
dims: [ 1, 3, 224, 224 ]
}
]

TensorRT Plan

name: "trt_model"
platform: "tensorrt_plan"
max_batch_size: 8

input [
{
name: "input"
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
}
]

optimization {
cuda {
graphs: true
}
}

Python Backend

name: "python_model"
backend: "python"
max_batch_size: 8

input [
{
name: "INPUT0"
data_type: TYPE_FP32
dims: [ -1 ]
}
]

output [
{
name: "OUTPUT0"
data_type: TYPE_FP32
dims: [ -1 ]
}
]

instance_group [
{
count: 2
kind: KIND_CPU
}
]

Input and Output Configuration

Data Types

# Common data types
TYPE_BOOL
TYPE_UINT8
TYPE_UINT16
TYPE_UINT32
TYPE_UINT64
TYPE_INT8
TYPE_INT16
TYPE_INT32
TYPE_INT64
TYPE_FP16
TYPE_FP32
TYPE_FP64
TYPE_STRING

Dynamic Shapes

input [
{
name: "input"
data_type: TYPE_FP32
dims: [ -1, 3, -1, -1 ] # -1 indicates dynamic dimension
}
]

Reshape

input [
{
name: "flat_input"
data_type: TYPE_FP32
dims: [ 784 ]
reshape: { shape: [ 1, 28, 28 ] }
}
]

Instance Configuration

Multiple GPU Instances

instance_group [
{
count: 2
kind: KIND_GPU
gpus: [ 0, 1 ] # Use GPUs 0 and 1
}
]

CPU and GPU Mix

instance_group [
{
count: 1
kind: KIND_GPU
gpus: [ 0 ]
},
{
count: 2
kind: KIND_CPU
}
]

Rate Limiting

instance_group [
{
count: 1
kind: KIND_GPU
rate_limiter {
resources [
{
name: "R1"
count: 4
}
]
}
}
]

Batching Strategies

Dynamic Batching

Automatically combines individual requests into batches:

dynamic_batching {
# Preferred batch sizes (Triton will try to create these)
preferred_batch_size: [ 4, 8 ]

# Maximum time to wait before sending a batch
max_queue_delay_microseconds: 100

# Preserve ordering of requests
preserve_ordering: false

# Priority levels
priority_levels: 3
default_priority_level: 1

# Queue policy
default_queue_policy {
timeout_action: REJECT
default_timeout_microseconds: 10000
allow_timeout_override: true
max_queue_size: 10
}
}

Sequence Batching

For stateful models (RNNs, transformers with state):

sequence_batching {
max_sequence_idle_microseconds: 5000000

control_input [
{
name: "START"
control [
{
kind: CONTROL_SEQUENCE_START
fp32_false_true: [ 0, 1 ]
}
]
},
{
name: "READY"
control [
{
kind: CONTROL_SEQUENCE_READY
fp32_false_true: [ 0, 1 ]
}
]
}
]

state [
{
input_name: "state_in"
output_name: "state_out"
data_type: TYPE_FP32
dims: [ 512 ]
initial_state {
data_type: TYPE_FP32
dims: [ 512 ]
zero_data: true
}
}
]
}

Model Versioning

Version Policies

Latest N Versions:

version_policy: { latest { num_versions: 2 }}

All Versions:

version_policy: { all { }}

Specific Versions:

version_policy: { 
specific {
versions: [1, 3, 5]
}
}

Version Labels

Create labels.txt in model directory:

stable 2
canary 3
latest 3

Model Ensembles

Create pipelines by chaining multiple models:

name: "ensemble_model"
platform: "ensemble"
max_batch_size: 8

input [
{
name: "IMAGE"
data_type: TYPE_UINT8
dims: [ -1, -1, 3 ]
}
]

output [
{
name: "CLASSIFICATION"
data_type: TYPE_FP32
dims: [ 1000 ]
}
]

ensemble_scheduling {
step [
{
model_name: "preprocessing"
model_version: -1
input_map {
key: "RAW_IMAGE"
value: "IMAGE"
}
output_map {
key: "PREPROCESSED_IMAGE"
value: "preprocessed_image"
}
},
{
model_name: "resnet50"
model_version: -1
input_map {
key: "input"
value: "preprocessed_image"
}
output_map {
key: "output"
value: "CLASSIFICATION"
}
}
]
}

Model Warmup

Pre-load model and allocate resources:

model_warmup [
{
name: "warmup_sample"
batch_size: 8
inputs {
key: "input"
value: {
data_type: TYPE_FP32
dims: [ 3, 224, 224 ]
zero_data: true
}
}
}
]

Cloud Storage

AWS S3

tritonserver --model-repository=s3://bucket-name/model-repository

Set credentials:

export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-west-2

Google Cloud Storage

tritonserver --model-repository=gs://bucket-name/model-repository

Set credentials:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json

Azure Blob Storage

tritonserver --model-repository=as://account-name/container/model-repository

Set credentials:

export AZURE_STORAGE_ACCOUNT=account_name
export AZURE_STORAGE_KEY=account_key

Model Control API

Load Model

curl -X POST localhost:8000/v2/repository/models/my_model/load

Unload Model

curl -X POST localhost:8000/v2/repository/models/my_model/unload

Reload All Models

curl -X POST localhost:8000/v2/repository/index

Get Model Status

curl localhost:8000/v2/models/my_model

Model Repository Polling

Enable automatic model updates:

tritonserver \
--model-repository=/models \
--model-control-mode=poll \
--repository-poll-secs=30

Best Practices

Organization

  1. Use Semantic Versioning: Name versions meaningfully (use labels)
  2. Keep Old Versions: Retain at least 2-3 versions for rollback
  3. Document Changes: Include version notes in a README
  4. Test Before Deploy: Validate new versions in staging

Configuration

  1. Start Simple: Use auto-configuration first, then optimize
  2. Profile Your Model: Use perf_analyzer before production
  3. Monitor Resources: Track GPU memory and utilization
  4. Set Limits: Configure max_batch_size appropriately

Performance

  1. Enable Dynamic Batching: For most workloads
  2. Choose Right Instance Count: Balance throughput and latency
  3. Use TensorRT: For NVIDIA GPUs when possible
  4. Optimize Input Pipeline: Pre-process data efficiently

Security

  1. Validate Inputs: Check input shapes and types
  2. Resource Limits: Set memory and compute limits
  3. Access Control: Restrict repository access
  4. Audit Logs: Monitor model loading/unloading

Troubleshooting

Model Won't Load

Check logs:

docker logs <container_id> 2>&1 | grep -i error

Common issues:

  • Invalid config.pbtxt syntax
  • Missing model files
  • Incompatible platform/backend
  • Insufficient GPU memory

Version Not Found

Verify version policy in config.pbtxt and ensure version directory exists.

Performance Issues

  1. Check instance count
  2. Verify dynamic batching settings
  3. Review GPU utilization metrics
  4. Profile with perf_analyzer

Next Steps