Monitoring and Observability

Learn how to monitor pipeline execution and model performance in Kubeflow.

1. Pipeline Monitoring

Monitor pipeline execution in the Kubeflow Dashboard:

# Access the dashboard
kubectl port-forward -n kubeflow svc/ml-pipeline-ui 8080:80

# View at http://localhost:8080

2. Model Performance Monitoring

Set up Prometheus and Grafana for model monitoring:

# model-monitoring.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: model-monitoring-config
  namespace: ml-serving
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
    scrape_configs:
      - job_name: 'kserve-metrics'
        kubernetes_sd_configs:
          - role: pod
            namespaces:
              names:
                - ml-serving
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_label_serving_kserve_io_inferenceservice]
            action: keep
            regex: churn-predictor
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: ml-serving
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      containers:
      - name: prometheus
        image: prom/prometheus:latest
        ports:
        - containerPort: 9090
        volumeMounts:
        - name: config
          mountPath: /etc/prometheus
      volumes:
      - name: config
        configMap:
          name: model-monitoring-config

3. Logging and Alerting

Configure alerts for model performance degradation:

# alerting-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-alerts
  namespace: ml-serving
data:
  alerts.yml: |
    groups:
      - name: model_performance
        interval: 30s
        rules:
          - alert: HighPredictionLatency
            expr: histogram_quantile(0.99, rate(kserve_request_duration_seconds_bucket[5m])) > 1
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High prediction latency detected"
              description: "99th percentile latency is above 1 second"
          
          - alert: HighErrorRate
            expr: rate(kserve_request_error_total[5m]) / rate(kserve_request_total[5m]) > 0.05
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "High error rate detected"
              description: "Error rate is above 5%"
          
          - alert: ModelDriftDetected
            expr: model_prediction_distribution_drift > 0.3
            for: 10m
            labels:
              severity: warning
            annotations:
              summary: "Model drift detected"
              description: "Prediction distribution has drifted significantly"

Next Steps

With monitoring in place:

Best Practices - Implement monitoring best practices
Troubleshooting - Debug monitoring issues

1. Pipeline Monitoring​

2. Model Performance Monitoring​

3. Logging and Alerting​

Next Steps​

1. Pipeline Monitoring

2. Model Performance Monitoring

3. Logging and Alerting

Next Steps