Skip to main content

Best Practices

Follow these best practices to build robust and scalable ML workflows with Kubeflow.

1. Pipeline Organization

Modular Components

  • Keep components small and focused on single responsibilities
  • Make components reusable across different pipelines
  • Use typed inputs and outputs for better validation

Version Control

  • Store pipeline definitions in Git
  • Tag pipeline versions for reproducibility
  • Document pipeline changes in commit messages

Testing

  • Test components individually before integration
  • Use small datasets for pipeline testing
  • Validate outputs at each stage

2. Resource Management

GPU Allocation

# training-job-gpu.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: mnist-training
namespace: kubeflow
spec:
tfReplicaSpecs:
Worker:
replicas: 2
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:latest-gpu
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1

Resource Quotas

# resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: ml-workloads-quota
namespace: ml-training
spec:
hard:
requests.cpu: "50"
requests.memory: 100Gi
requests.nvidia.com/gpu: "8"
limits.cpu: "100"
limits.memory: 200Gi
limits.nvidia.com/gpu: "8"

3. Security Best Practices

Use RBAC

# ml-engineer-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: ml-engineer
namespace: ml-platform
rules:
- apiGroups: ["kubeflow.org"]
resources: ["notebooks", "experiments", "trials"]
verbs: ["get", "list", "create", "update", "delete"]
- apiGroups: ["serving.kserve.io"]
resources: ["inferenceservices"]
verbs: ["get", "list", "create", "update"]

Secure Secrets

# database-secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: db-credentials
namespace: ml-platform
type: Opaque
stringData:
username: ml-user
password: secure-password
connection-string: postgresql://ml-user:password@postgres:5432/mldata

4. Model Versioning

Track model versions and lineage:

# model_versioning.py
from kfp.v2 import dsl
from datetime import datetime

@dsl.component
def register_model_version(
model: Input[Model],
model_name: str,
version: str,
metadata: dict
):
"""Register model version in model registry."""
import mlflow

mlflow.set_tracking_uri("http://mlflow-server:5000")

# Register model
model_uri = f"models:/{model_name}/{version}"
mlflow.register_model(
model.uri,
model_name,
tags={
'version': version,
'timestamp': datetime.now().isoformat(),
**metadata
}
)

print(f"Model registered: {model_name} version {version}")

5. Data Validation

Implement data validation in pipelines:

# data_validation.py
from kfp.v2.dsl import component, Input, Output, Dataset

@component(
base_image='python:3.9',
packages_to_install=['great-expectations', 'pandas']
)
def validate_data(
input_dataset: Input[Dataset],
validation_result: Output[Dataset]
):
"""Validate data quality using Great Expectations."""
import pandas as pd
import great_expectations as ge

# Load data
df = pd.read_csv(input_dataset.path)

# Convert to Great Expectations DataFrame
ge_df = ge.from_pandas(df)

# Define expectations
ge_df.expect_column_values_to_not_be_null('customer_id')
ge_df.expect_column_values_to_be_unique('customer_id')
ge_df.expect_column_values_to_be_between('age', min_value=18, max_value=100)
ge_df.expect_column_values_to_be_in_set('subscription_type', ['basic', 'premium', 'enterprise'])

# Validate
validation_results = ge_df.validate()

# Check if validation passed
if not validation_results['success']:
failed_expectations = [
exp for exp in validation_results['results']
if not exp['success']
]
raise ValueError(f"Data validation failed: {failed_expectations}")

# Save validation results
with open(validation_result.path, 'w') as f:
import json
json.dump(validation_results, f)

print("Data validation passed successfully")

Next Steps

Apply these best practices: