Model Repository

The model repository is a file-system based repository of the models that Triton will make available for inferencing. This guide covers how to organize, configure, and manage models in the repository.

Repository Structure

Basic Structure

model_repository/
├── model_1/
│   ├── config.pbtxt
│   ├── 1/
│   │   └── model.onnx
│   └── 2/
│       └── model.onnx
├── model_2/
│   ├── config.pbtxt
│   └── 1/
│       └── model.savedmodel/
└── model_3/
    ├── config.pbtxt
    └── 1/
        └── model.pt

Key Components

Model Directory: Named after the model (e.g., resnet50)
Version Directories: Numeric directories (1, 2, 3, etc.)
Model Files: The actual model artifacts
Configuration File: config.pbtxt (optional but recommended)

Model Configuration

Minimal Configuration

For many models, Triton can auto-generate configuration:

name: "my_model"
platform: "onnxruntime_onnx"

Complete Configuration Example

name: "resnet50"
platform: "onnxruntime_onnx"
backend: "onnxruntime"
max_batch_size: 8
default_model_filename: "model.onnx"

# Input specification
input [
  {
    name: "input_image"
    data_type: TYPE_FP32
    format: FORMAT_NCHW
    dims: [ 3, 224, 224 ]
  }
]

# Output specification
output [
  {
    name: "predictions"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

# Instance configuration
instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0 ]
  }
]

# Batching configuration
dynamic_batching {
  preferred_batch_size: [ 4, 8 ]
  max_queue_delay_microseconds: 100
}

# Version policy
version_policy: { latest { num_versions: 2 }}

Platform Specifications

ONNX Runtime

name: "onnx_model"
platform: "onnxruntime_onnx"
default_model_filename: "model.onnx"

optimization {
  execution_accelerators {
    gpu_execution_accelerator : [ {
      name : "tensorrt"
      parameters { key: "precision_mode" value: "FP16" }
    }]
  }
}

TensorFlow SavedModel

name: "tf_model"
platform: "tensorflow_savedmodel"

# GPU memory fraction
parameters: {
  key: "gpu_memory_fraction"
  value: { string_value: "0.5" }
}

PyTorch TorchScript

name: "pytorch_model"
platform: "pytorch_libtorch"
max_batch_size: 0  # For models with batch in the model itself

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 1, 3, 224, 224 ]
  }
]

TensorRT Plan

name: "trt_model"
platform: "tensorrt_plan"
max_batch_size: 8

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ 3, 224, 224 ]
  }
]

optimization {
  cuda {
    graphs: true
  }
}

Python Backend

name: "python_model"
backend: "python"
max_batch_size: 8

input [
  {
    name: "INPUT0"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]

output [
  {
    name: "OUTPUT0"
    data_type: TYPE_FP32
    dims: [ -1 ]
  }
]

instance_group [
  {
    count: 2
    kind: KIND_CPU
  }
]

Input and Output Configuration

Data Types

# Common data types
TYPE_BOOL
TYPE_UINT8
TYPE_UINT16
TYPE_UINT32
TYPE_UINT64
TYPE_INT8
TYPE_INT16
TYPE_INT32
TYPE_INT64
TYPE_FP16
TYPE_FP32
TYPE_FP64
TYPE_STRING

Dynamic Shapes

input [
  {
    name: "input"
    data_type: TYPE_FP32
    dims: [ -1, 3, -1, -1 ]  # -1 indicates dynamic dimension
  }
]

Reshape

input [
  {
    name: "flat_input"
    data_type: TYPE_FP32
    dims: [ 784 ]
    reshape: { shape: [ 1, 28, 28 ] }
  }
]

Instance Configuration

Multiple GPU Instances

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0, 1 ]  # Use GPUs 0 and 1
  }
]

CPU and GPU Mix

instance_group [
  {
    count: 1
    kind: KIND_GPU
    gpus: [ 0 ]
  },
  {
    count: 2
    kind: KIND_CPU
  }
]

Rate Limiting

instance_group [
  {
    count: 1
    kind: KIND_GPU
    rate_limiter {
      resources [
        {
          name: "R1"
          count: 4
        }
      ]
    }
  }
]

Batching Strategies

Dynamic Batching

Automatically combines individual requests into batches:

dynamic_batching {
  # Preferred batch sizes (Triton will try to create these)
  preferred_batch_size: [ 4, 8 ]
  
  # Maximum time to wait before sending a batch
  max_queue_delay_microseconds: 100
  
  # Preserve ordering of requests
  preserve_ordering: false
  
  # Priority levels
  priority_levels: 3
  default_priority_level: 1
  
  # Queue policy
  default_queue_policy {
    timeout_action: REJECT
    default_timeout_microseconds: 10000
    allow_timeout_override: true
    max_queue_size: 10
  }
}

Sequence Batching

For stateful models (RNNs, transformers with state):

sequence_batching {
  max_sequence_idle_microseconds: 5000000
  
  control_input [
    {
      name: "START"
      control [
        {
          kind: CONTROL_SEQUENCE_START
          fp32_false_true: [ 0, 1 ]
        }
      ]
    },
    {
      name: "READY"
      control [
        {
          kind: CONTROL_SEQUENCE_READY
          fp32_false_true: [ 0, 1 ]
        }
      ]
    }
  ]
  
  state [
    {
      input_name: "state_in"
      output_name: "state_out"
      data_type: TYPE_FP32
      dims: [ 512 ]
      initial_state {
        data_type: TYPE_FP32
        dims: [ 512 ]
        zero_data: true
      }
    }
  ]
}

Model Versioning

Version Policies

Latest N Versions:

version_policy: { latest { num_versions: 2 }}

All Versions:

version_policy: { all { }}

Specific Versions:

version_policy: { 
  specific { 
    versions: [1, 3, 5] 
  } 
}

Version Labels

Create labels.txt in model directory:

stable 2
canary 3
latest 3

Model Ensembles

Create pipelines by chaining multiple models:

name: "ensemble_model"
platform: "ensemble"
max_batch_size: 8

input [
  {
    name: "IMAGE"
    data_type: TYPE_UINT8
    dims: [ -1, -1, 3 ]
  }
]

output [
  {
    name: "CLASSIFICATION"
    data_type: TYPE_FP32
    dims: [ 1000 ]
  }
]

ensemble_scheduling {
  step [
    {
      model_name: "preprocessing"
      model_version: -1
      input_map {
        key: "RAW_IMAGE"
        value: "IMAGE"
      }
      output_map {
        key: "PREPROCESSED_IMAGE"
        value: "preprocessed_image"
      }
    },
    {
      model_name: "resnet50"
      model_version: -1
      input_map {
        key: "input"
        value: "preprocessed_image"
      }
      output_map {
        key: "output"
        value: "CLASSIFICATION"
      }
    }
  ]
}

Model Warmup

Pre-load model and allocate resources:

model_warmup [
  {
    name: "warmup_sample"
    batch_size: 8
    inputs {
      key: "input"
      value: {
        data_type: TYPE_FP32
        dims: [ 3, 224, 224 ]
        zero_data: true
      }
    }
  }
]

Cloud Storage

AWS S3

tritonserver --model-repository=s3://bucket-name/model-repository

Set credentials:

export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-west-2

Google Cloud Storage

tritonserver --model-repository=gs://bucket-name/model-repository

Set credentials:

export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json

Azure Blob Storage

tritonserver --model-repository=as://account-name/container/model-repository

Set credentials:

export AZURE_STORAGE_ACCOUNT=account_name
export AZURE_STORAGE_KEY=account_key

Model Control API

Load Model

curl -X POST localhost:8000/v2/repository/models/my_model/load

Unload Model

curl -X POST localhost:8000/v2/repository/models/my_model/unload

Reload All Models

curl -X POST localhost:8000/v2/repository/index

Get Model Status

curl localhost:8000/v2/models/my_model