MLflow + Kubernetes: Production-Grade Model Serving for Sentence Transformers







MLflow + Kubernetes: Production-Grade Model Serving for Sentence Transformers | Tech Deep Dive


Production-Grade Model Serving for Sentence Transformers

Part 2 of 3: A practical walk-through of model versioning, registry management, API serving, and GPU-backed Kubernetes deployment.

Introduction

In Part 1, I covered the motivations behind moving to a more structured MLOps setup.

This post focuses on how everything fits together: MLflow, the model registry, FastAPI, and Kubernetes.

The goal is simple: a predictable, reproducible way to train models, log them, promote them, and deploy them – all without downtime.

Everything shown here is based on the system I run in production.

1. Setting Up MLflow Tracking

MLflow acts as the central source of truth. Every experiment, configuration, and model version is logged there.

Python: Logging a training run

Listing 1: MLflow experiment tracking

import mlflow
import mlflow.pytorch
from sentence_transformers import SentenceTransformer

mlflow.set_tracking_uri("http://mlflow:5000")
mlflow.set_experiment("vehicle-defect-prediction")

with mlflow.start_run():
    model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

    mlflow.log_param("embedding_dim", 384)
    mlflow.log_param("model_name", "MiniLM-L6-v2")

    mlflow.pytorch.log_model(
        model,
        "model",
        registered_model_name="MiniLM-Defect-Predictor"
    )

    mlflow.log_metric("inference_latency_ms", 85.3)
    mlflow.log_metric("gpu_memory_mb", 2048)

This gives you a full record of what was trained, how it was configured, and the resulting performance.

2. Model Registry and Versioning

Once the run is logged, you can register the model and promote versions through stages like Staging and Production.

Listing 2: Model versioning and stage transitions

from mlflow.tracking import MlflowClient

client = MlflowClient()

version = client.create_model_version(
    name="MiniLM-Defect-Predictor",
    source="runs://model",
    description="MiniLM model for defect prediction"
)

client.transition_model_version_stage(
    name="MiniLM-Defect-Predictor",
    version=version.version,
    stage="Staging"
)

Promoting to production is just another simple transition:

client.transition_model_version_stage(
    name="MiniLM-Defect-Predictor",
    version=version.version,
    stage="Production"
)

Once that happens, everything downstream – FastAPI, Kubernetes, monitoring – will pull the correct production version.

3. FastAPI: Loading the Production Model

FastAPI is the interface layer. Instead of bundling the model with the app, it loads the current production version directly from MLflow.

Listing 3: FastAPI model loading from MLflow registry

import mlflow.pyfunc
from fastapi import FastAPI

app = FastAPI()
MODEL_URI = "models:/MiniLM-Defect-Predictor/Production"

class ModelCache:
    _model = None

    @classmethod
    def get(cls):
        if cls._model is None:
            cls._model = mlflow.pyfunc.load_model(MODEL_URI)
        return cls._model

@app.post("/predict")
def predict(text: str):
    model = ModelCache.get()
    embedding = model.predict([text])
    return {"embedding": embedding.tolist()}

The model is loaded once per process and reused, which avoids repeated GPU initialisation.

4. Kubernetes Deployment (GPU + MLflow)

Below is a simplified version of what runs in production. This demonstrates GPU scheduling, environment injection, and readiness checks.

Inference Pod (FastAPI + GPU)

Listing 4: Kubernetes deployment for GPU-backed inference

apiVersion: apps/v1
kind: Deployment
metadata:
  name: carhunch-api
spec:
  replicas: 2
  selector:
    matchLabels: { app: carhunch-api }
  template:
    metadata:
      labels: { app: carhunch-api }
    spec:
      containers:
      - name: api
        image: ghcr.io/yourrepo/carhunch-api:latest
        env:
        - name: MLFLOW_MODEL_URI
          value: "models:/MiniLM-Defect-Predictor/Production"
        resources:
          requests:
            cpu: "1"
            memory: "4Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "4"
            memory: "16Gi"
            nvidia.com/gpu: "1"
        ports:
        - containerPort: 8001
        readinessProbe:
          httpGet:
            path: /ready
            port: 8001

MLflow Tracking Server Deployment

For simplicity, this uses SQLite; in practice you can switch to PostgreSQL or MySQL easily.

Listing 5: MLflow tracking server deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-tracking
spec:
  replicas: 1
  selector:
    matchLabels: { app: mlflow-tracking }
  template:
    metadata:
      labels: { app: mlflow-tracking }
    spec:
      containers:
      - name: mlflow
        image: ghcr.io/mlflow/mlflow:latest
        args: ["mlflow", "server", "--backend-store-uri", "sqlite:///mlflow.db"]
        ports:
        - containerPort: 5000

5. Zero-Downtime Updates (Rolling Strategy)

Kubernetes’ rolling update strategy ensures upgrades happen gradually:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0

When a new model is promoted in MLflow (or a new image is released), pods are updated one at a time while keeping the service fully available.

Closing of Part 2

At this point, the core pipeline is in place:

  • MLflow tracking server
  • Experiment and model logging
  • A consistent model registry
  • FastAPI loading production models automatically
  • GPU-backed Kubernetes deployment
  • Zero-downtime updates via rolling releases

In Part 3, we’ll cover:

  • Monitoring and prediction logging
  • Drift detection
  • Canary deployments
  • Rolling updates with model-aware routing
  • Automated model promotion

Part 3 completes the end-to-end workflow. After that, I’ll publish the separate GPU deep-dive.



Discover more from Don's Blog

Subscribe to get the latest posts sent to your email.

2 thoughts on “MLflow + Kubernetes: Production-Grade Model Serving for Sentence Transformers”

Leave a Reply