Model Deployment and Serving
Learn how to deploy and serve ML models using KServe (formerly KFServing) in Kubeflow.
Deploy Model with KServe
KServe provides a simple way to deploy models for inference:
1. Create InferenceService
# inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: churn-predictor
namespace: ml-serving
spec:
predictor:
sklearn:
storageUri: gs://my-bucket/models/churn-model
resources:
requests:
cpu: "100m"
memory: "256Mi"
limits:
cpu: "1"
memory: "1Gi"
minReplicas: 1
maxReplicas: 10
scaleTarget: 50
scaleMetric: concurrency
Apply the InferenceService:
kubectl apply -f inference-service.yaml
# Wait for service to be ready
kubectl wait --for=condition=Ready inferenceservice/churn-predictor -n ml-serving --timeout=300s
# Get service URL
kubectl get inferenceservice churn-predictor -n ml-serving
2. Test the Deployed Model
# test_inference.py
import requests
import json
# Get the inference service URL
SERVICE_URL = "http://churn-predictor.ml-serving.svc.cluster.local/v1/models/churn-predictor:predict"
# Prepare input data
input_data = {
"instances": [
{
"account_length": 128,
"international_plan": 0,
"voice_mail_plan": 1,
"number_vmail_messages": 25,
"total_day_minutes": 265.1,
"total_day_calls": 110,
"total_eve_minutes": 197.4,
"total_eve_calls": 99,
"total_night_minutes": 244.7,
"total_night_calls": 91,
"total_intl_minutes": 10.0,
"total_intl_calls": 3
}
]
}
# Make prediction request
response = requests.post(
SERVICE_URL,
headers={'Content-Type': 'application/json'},
data=json.dumps(input_data)
)
# Print prediction
if response.status_code == 200:
prediction = response.json()
print(f"Prediction: {prediction}")
else:
print(f"Error: {response.status_code} - {response.text}")
3. Canary Deployment for A/B Testing
Deploy a new model version alongside the existing one:
# canary-deployment.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: churn-predictor
namespace: ml-serving
spec:
predictor:
sklearn:
storageUri: gs://my-bucket/models/churn-model-v1
resources:
requests:
cpu: "100m"
memory: "256Mi"
minReplicas: 2
canaryTrafficPercent: 20
canary:
predictor:
sklearn:
storageUri: gs://my-bucket/models/churn-model-v2
resources:
requests:
cpu: "100m"
memory: "256Mi"
minReplicas: 1
This configuration routes 20% of traffic to the new model version for testing.
Next Steps
After deploying your models:
- Monitoring - Set up monitoring for deployed models
- Best Practices - Follow deployment best practices
- Troubleshooting - Resolve common deployment issues