Monitoring, Drift Detection and Zero-Downtime Model Releases | Tech Deep Dive

Monitoring, Drift Detection and Zero-Downtime Model Releases

Part 3 of 3: Production-grade monitoring, prediction logging, and safe deployment workflows.

Introduction

In Part 1 and Part 2, we built the core of the system: reproducible training, a proper model registry, and Kubernetes-backed deployments.

Now the focus shifts to what happens after a model goes live.

This post covers the production-side essentials:

Logging predictions for operational visibility
Detecting model drift
Canary deployments and safe rollout workflows
Automated model promotion
The real-world performance improvements

1. Logging Predictions for Monitoring

To understand how the system behaves in production, every prediction is logged – lightweight, structured, and tied back to model versions via MLflow.

Listing 1: Prediction logging to MLflow

import mlflow
import time

def log_prediction(text, latency, confidence):
    with mlflow.start_run(nested=True):
        mlflow.log_param("input_length", len(text))
        mlflow.log_metric("latency_ms", latency)
        mlflow.log_metric("confidence", confidence)
        mlflow.log_metric("timestamp", time.time())

This gives you enough data to build dashboards showing:

Latency trends
Throughput
Confidence drift
Input distribution changes
Model performance over time

Even simple plots can reveal early warning signs long before they become user-visible issues.

2. Drift Detection Script

A basic example of analysing logged metrics for unusual changes:

Listing 2: Model drift detection

import numpy as np
from mlflow.tracking import MlflowClient

def detect_drift():
    client = MlflowClient()

    runs = client.search_runs(
        experiment_ids=["0"],
        filter_string="metrics.latency_ms > 0",
        max_results=500
    )

    latencies = [r.data.metrics["latency_ms"] for r in runs]
    confs = [r.data.metrics["confidence"] for r in runs]

    if np.mean(latencies) > 120:
        alert("Latency drift detected")

    if np.mean(confs) < 0.75:
        alert("Confidence drift detected")

You can plug in more advanced statistical tests later (KL divergence, embedding space drift, or decayed moving averages).

3. Canary Deployment (10% Traffic)

A canary deployment lets you test the new model under real load before promoting it fully.

Versioned pods:

Listing 3: Canary deployment configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: carhunch-api-v2
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: carhunch-api
        version: "v2"

The service routes traffic to both versions:

selector:
  app: carhunch-api

With 1 replica of v2 and (for example) 9 replicas of v1, the canary receives roughly 10% of requests.

Kubernetes handles the balancing naturally.

4. Automated Promotion Script

A simple automated workflow to move models through Staging → Canary → Production:

Listing 4: Automated model promotion workflow

def promote_model(version):
    # 1. Move to staging
    client.transition_model_version_stage(
        "MiniLM-Defect-Predictor",
        version,
        "Staging"
    )

    # 2. Deploy canary
    subprocess.run(["kubectl", "scale", "deployment/carhunch-api-v2", "--replicas=1"])

    # 3. Wait and collect metrics
    time.sleep(3600)

    # ...evaluate metrics here...

    # 4. Promote to production if everything looks good
    client.transition_model_version_stage(
        "MiniLM-Defect-Predictor",
        version,
        "Production"
    )

This keeps the deployment pipeline simple but still safe:

No big-bang releases
Measurable confidence before promotion
Fully automated transitions if desired

5. Performance Gains

Metric	Before	After	Improvement
Deployment downtime	15–30 min	0 min	100%
Inference latency	~120ms	~85ms	~29% faster
Prediction cost	£500/mo	£5/mo	99% cheaper
GPU stability	Frequent leaks	Stable	Fully fixed
Traceability	None	Full MLflow registry	100%

These improvements came primarily from:

Moving off external API calls
Running inference locally on a small GPU
Using MLflow for proper version tracking
Cleaner model lifecycle management

Final Closing: What's Next

With this final part complete, the full workflow now covers:

MLflow model registry and experiment tracking
FastAPI model serving
GPU-backed Kubernetes deployments
Prediction monitoring and drift detection
Canary releases and safe rollouts
Zero-downtime updates

There's one major topic left that deserves its own article:

Deep GPU + Kubernetes Optimisation

Memory fragmentation, batching strategies, GPU sharing, node feature discovery, device plugin tuning - the stuff that affects real-world performance far more than most people expect.

That full technical deep-dive is coming next.

Discover more from Don's Blog

Subscribe to get the latest posts sent to your email.

Production-grade monitoring, prediction logging, and safe deployment workflows

Monitoring, Drift Detection and Zero-Downtime Model Releases

Introduction

1. Logging Predictions for Monitoring

2. Drift Detection Script

3. Canary Deployment (10% Traffic)

4. Automated Promotion Script

5. Performance Gains

Final Closing: What's Next

Like this:

Related

Discover more from Don's Blog

2 thoughts on “Production-grade monitoring, prediction logging, and safe deployment workflows”

Leave a ReplyCancel reply

Monitoring, Drift Detection and Zero-Downtime Model Releases

Introduction

1. Logging Predictions for Monitoring

2. Drift Detection Script

3. Canary Deployment (10% Traffic)

4. Automated Promotion Script

5. Performance Gains

Final Closing: What's Next

Share this:

Like this:

Related

Discover more from Don's Blog

2 thoughts on “Production-grade monitoring, prediction logging, and safe deployment workflows”

Leave a ReplyCancel reply