Best Practices
Follow these best practices to build robust and scalable ML workflows with Kubeflow.
1. Pipeline Organization
Modular Components
- Keep components small and focused on single responsibilities
- Make components reusable across different pipelines
- Use typed inputs and outputs for better validation
Version Control
- Store pipeline definitions in Git
- Tag pipeline versions for reproducibility
- Document pipeline changes in commit messages
Testing
- Test components individually before integration
- Use small datasets for pipeline testing
- Validate outputs at each stage
2. Resource Management
GPU Allocation
# training-job-gpu.yaml
apiVersion: kubeflow.org/v1
kind: TFJob
metadata:
name: mnist-training
namespace: kubeflow
spec:
tfReplicaSpecs:
Worker:
replicas: 2
template:
spec:
containers:
- name: tensorflow
image: tensorflow/tensorflow:latest-gpu
resources:
limits:
nvidia.com/gpu: 1
requests:
nvidia.com/gpu: 1
Resource Quotas
# resource-quota.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
name: ml-workloads-quota
namespace: ml-training
spec:
hard:
requests.cpu: "50"
requests.memory: 100Gi
requests.nvidia.com/gpu: "8"
limits.cpu: "100"
limits.memory: 200Gi
limits.nvidia.com/gpu: "8"
3. Security Best Practices
Use RBAC
# ml-engineer-role.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: ml-engineer
namespace: ml-platform
rules:
- apiGroups: ["kubeflow.org"]
resources: ["notebooks", "experiments", "trials"]
verbs: ["get", "list", "create", "update", "delete"]
- apiGroups: ["serving.kserve.io"]
resources: ["inferenceservices"]
verbs: ["get", "list", "create", "update"]
Secure Secrets
# database-secret.yaml
apiVersion: v1
kind: Secret
metadata:
name: db-credentials
namespace: ml-platform
type: Opaque
stringData:
username: ml-user
password: secure-password
connection-string: postgresql://ml-user:password@postgres:5432/mldata
4. Model Versioning
Track model versions and lineage:
# model_versioning.py
from kfp.v2 import dsl
from datetime import datetime
@dsl.component
def register_model_version(
model: Input[Model],
model_name: str,
version: str,
metadata: dict
):
"""Register model version in model registry."""
import mlflow
mlflow.set_tracking_uri("http://mlflow-server:5000")
# Register model
model_uri = f"models:/{model_name}/{version}"
mlflow.register_model(
model.uri,
model_name,
tags={
'version': version,
'timestamp': datetime.now().isoformat(),
**metadata
}
)
print(f"Model registered: {model_name} version {version}")
5. Data Validation
Implement data validation in pipelines:
# data_validation.py
from kfp.v2.dsl import component, Input, Output, Dataset
@component(
base_image='python:3.9',
packages_to_install=['great-expectations', 'pandas']
)
def validate_data(
input_dataset: Input[Dataset],
validation_result: Output[Dataset]
):
"""Validate data quality using Great Expectations."""
import pandas as pd
import great_expectations as ge
# Load data
df = pd.read_csv(input_dataset.path)
# Convert to Great Expectations DataFrame
ge_df = ge.from_pandas(df)
# Define expectations
ge_df.expect_column_values_to_not_be_null('customer_id')
ge_df.expect_column_values_to_be_unique('customer_id')
ge_df.expect_column_values_to_be_between('age', min_value=18, max_value=100)
ge_df.expect_column_values_to_be_in_set('subscription_type', ['basic', 'premium', 'enterprise'])
# Validate
validation_results = ge_df.validate()
# Check if validation passed
if not validation_results['success']:
failed_expectations = [
exp for exp in validation_results['results']
if not exp['success']
]
raise ValueError(f"Data validation failed: {failed_expectations}")
# Save validation results
with open(validation_result.path, 'w') as f:
import json
json.dump(validation_results, f)
print("Data validation passed successfully")
Next Steps
Apply these best practices:
- Troubleshooting - Debug common issues
- Introduction - Review Kubeflow concepts