Model Deployment and Serving

Learn how to deploy and serve ML models using KServe (formerly KFServing) in Kubeflow.

Deploy Model with KServe

KServe provides a simple way to deploy models for inference:

1. Create InferenceService

# inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: churn-predictor
  namespace: ml-serving
spec:
  predictor:
    sklearn:
      storageUri: gs://my-bucket/models/churn-model
      resources:
        requests:
          cpu: "100m"
          memory: "256Mi"
        limits:
          cpu: "1"
          memory: "1Gi"
    minReplicas: 1
    maxReplicas: 10
    scaleTarget: 50
    scaleMetric: concurrency

Apply the InferenceService:

kubectl apply -f inference-service.yaml

# Wait for service to be ready
kubectl wait --for=condition=Ready inferenceservice/churn-predictor -n ml-serving --timeout=300s

# Get service URL
kubectl get inferenceservice churn-predictor -n ml-serving

2. Test the Deployed Model

# test_inference.py
import requests
import json

# Get the inference service URL
SERVICE_URL = "http://churn-predictor.ml-serving.svc.cluster.local/v1/models/churn-predictor:predict"

# Prepare input data
input_data = {
    "instances": [
        {
            "account_length": 128,
            "international_plan": 0,
            "voice_mail_plan": 1,
            "number_vmail_messages": 25,
            "total_day_minutes": 265.1,
            "total_day_calls": 110,
            "total_eve_minutes": 197.4,
            "total_eve_calls": 99,
            "total_night_minutes": 244.7,
            "total_night_calls": 91,
            "total_intl_minutes": 10.0,
            "total_intl_calls": 3
        }
    ]
}

# Make prediction request
response = requests.post(
    SERVICE_URL,
    headers={'Content-Type': 'application/json'},
    data=json.dumps(input_data)
)

# Print prediction
if response.status_code == 200:
    prediction = response.json()
    print(f"Prediction: {prediction}")
else:
    print(f"Error: {response.status_code} - {response.text}")

3. Canary Deployment for A/B Testing

Deploy a new model version alongside the existing one:

# canary-deployment.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: churn-predictor
  namespace: ml-serving
spec:
  predictor:
    sklearn:
      storageUri: gs://my-bucket/models/churn-model-v1
      resources:
        requests:
          cpu: "100m"
          memory: "256Mi"
    minReplicas: 2
  canaryTrafficPercent: 20
  canary:
    predictor:
      sklearn:
        storageUri: gs://my-bucket/models/churn-model-v2
        resources:
          requests:
            cpu: "100m"
            memory: "256Mi"
      minReplicas: 1

This configuration routes 20% of traffic to the new model version for testing.

Next Steps

After deploying your models:

Monitoring - Set up monitoring for deployed models
Best Practices - Follow deployment best practices
Troubleshooting - Resolve common deployment issues

Deploy Model with KServe​

1. Create InferenceService​

2. Test the Deployed Model​

3. Canary Deployment for A/B Testing​

Next Steps​

Deploy Model with KServe

1. Create InferenceService

2. Test the Deployed Model

3. Canary Deployment for A/B Testing

Next Steps