Monitoring and Observability
Learn how to monitor pipeline execution and model performance in Kubeflow.
1. Pipeline Monitoring
Monitor pipeline execution in the Kubeflow Dashboard:
# Access the dashboard
kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80
# View at http://localhost:8080
2. Model Performance Monitoring
Set up Prometheus and Grafana for model monitoring:
# model-monitoring.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: model-monitoring-config
namespace: ml-serving
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'kserve-metrics'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- ml-serving
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_serving_kserve_io_inferenceservice]
action: keep
regex: churn-predictor
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: ml-serving
spec:
replicas: 1
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
containers:
- name: prometheus
image: prom/prometheus:latest
ports:
- containerPort: 9090
volumeMounts:
- name: config
mountPath: /etc/prometheus
volumes:
- name: config
configMap:
name: model-monitoring-config
3. Logging and Alerting
Configure alerts for model performance degradation:
# alerting-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-alerts
namespace: ml-serving
data:
alerts.yml: |
groups:
- name: model_performance
interval: 30s
rules:
- alert: HighPredictionLatency
expr: histogram_quantile(0.99, rate(kserve_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High prediction latency detected"
description: "99th percentile latency is above 1 second"
- alert: HighErrorRate
expr: rate(kserve_request_error_total[5m]) / rate(kserve_request_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is above 5%"
- alert: ModelDriftDetected
expr: model_prediction_distribution_drift > 0.3
for: 10m
labels:
severity: warning
annotations:
summary: "Model drift detected"
description: "Prediction distribution has drifted significantly"
Next Steps
With monitoring in place:
- Best Practices - Implement monitoring best practices
- Troubleshooting - Debug monitoring issues