Monitoring, Drift Detection and Zero-Downtime Model Releases
Part 3 of 3: Production-grade monitoring, prediction logging, and safe deployment workflows.
Introduction
In Part 1 and Part 2, we built the core of the system: reproducible training, a proper model registry, and Kubernetes-backed deployments.
Now the focus shifts to what happens after a model goes live.
This post covers the production-side essentials:
- Logging predictions for operational visibility
- Detecting model drift
- Canary deployments and safe rollout workflows
- Automated model promotion
- The real-world performance improvements
1. Logging Predictions for Monitoring
To understand how the system behaves in production, every prediction is logged – lightweight, structured, and tied back to model versions via MLflow.
Listing 1: Prediction logging to MLflow
import mlflow
import time
def log_prediction(text, latency, confidence):
with mlflow.start_run(nested=True):
mlflow.log_param("input_length", len(text))
mlflow.log_metric("latency_ms", latency)
mlflow.log_metric("confidence", confidence)
mlflow.log_metric("timestamp", time.time())
This gives you enough data to build dashboards showing:
- Latency trends
- Throughput
- Confidence drift
- Input distribution changes
- Model performance over time
Even simple plots can reveal early warning signs long before they become user-visible issues.
2. Drift Detection Script
A basic example of analysing logged metrics for unusual changes:
Listing 2: Model drift detection
import numpy as np
from mlflow.tracking import MlflowClient
def detect_drift():
client = MlflowClient()
runs = client.search_runs(
experiment_ids=["0"],
filter_string="metrics.latency_ms > 0",
max_results=500
)
latencies = [r.data.metrics["latency_ms"] for r in runs]
confs = [r.data.metrics["confidence"] for r in runs]
if np.mean(latencies) > 120:
alert("Latency drift detected")
if np.mean(confs) < 0.75:
alert("Confidence drift detected")
You can plug in more advanced statistical tests later (KL divergence, embedding space drift, or decayed moving averages).
3. Canary Deployment (10% Traffic)
A canary deployment lets you test the new model under real load before promoting it fully.
Versioned pods:
Listing 3: Canary deployment configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: carhunch-api-v2
spec:
replicas: 1
template:
metadata:
labels:
app: carhunch-api
version: "v2"
The service routes traffic to both versions:
selector: app: carhunch-api
With 1 replica of v2 and (for example) 9 replicas of v1, the canary receives roughly 10% of requests.
Kubernetes handles the balancing naturally.
4. Automated Promotion Script
A simple automated workflow to move models through Staging → Canary → Production:
Listing 4: Automated model promotion workflow
def promote_model(version):
# 1. Move to staging
client.transition_model_version_stage(
"MiniLM-Defect-Predictor",
version,
"Staging"
)
# 2. Deploy canary
subprocess.run(["kubectl", "scale", "deployment/carhunch-api-v2", "--replicas=1"])
# 3. Wait and collect metrics
time.sleep(3600)
# ...evaluate metrics here...
# 4. Promote to production if everything looks good
client.transition_model_version_stage(
"MiniLM-Defect-Predictor",
version,
"Production"
)
This keeps the deployment pipeline simple but still safe:
- No big-bang releases
- Measurable confidence before promotion
- Fully automated transitions if desired
5. Performance Gains
| Metric | Before | After | Improvement |
|---|---|---|---|
| Deployment downtime | 15–30 min | 0 min | 100% |
| Inference latency | ~120ms | ~85ms | ~29% faster |
| Prediction cost | £500/mo | £5/mo | 99% cheaper |
| GPU stability | Frequent leaks | Stable | Fully fixed |
| Traceability | None | Full MLflow registry | 100% |
These improvements came primarily from:
- Moving off external API calls
- Running inference locally on a small GPU
- Using MLflow for proper version tracking
- Cleaner model lifecycle management
Final Closing: What's Next
With this final part complete, the full workflow now covers:
- MLflow model registry and experiment tracking
- FastAPI model serving
- GPU-backed Kubernetes deployments
- Prediction monitoring and drift detection
- Canary releases and safe rollouts
- Zero-downtime updates
There's one major topic left that deserves its own article:
Deep GPU + Kubernetes Optimisation
Memory fragmentation, batching strategies, GPU sharing, node feature discovery, device plugin tuning - the stuff that affects real-world performance far more than most people expect.
That full technical deep-dive is coming next.
Discover more from Don's Blog
Subscribe to get the latest posts sent to your email.
2 thoughts on “Production-grade monitoring, prediction logging, and safe deployment workflows”