Production-Grade Model Serving for Sentence Transformers
Part 2 of 3: A practical walk-through of model versioning, registry management, API serving, and GPU-backed Kubernetes deployment.
Introduction
In Part 1, I covered the motivations behind moving to a more structured MLOps setup.
This post focuses on how everything fits together: MLflow, the model registry, FastAPI, and Kubernetes.
The goal is simple: a predictable, reproducible way to train models, log them, promote them, and deploy them – all without downtime.
Everything shown here is based on the system I run in production.
1. Setting Up MLflow Tracking
MLflow acts as the central source of truth. Every experiment, configuration, and model version is logged there.
Python: Logging a training run
Listing 1: MLflow experiment tracking
import mlflow
import mlflow.pytorch
from sentence_transformers import SentenceTransformer
mlflow.set_tracking_uri("http://mlflow:5000")
mlflow.set_experiment("vehicle-defect-prediction")
with mlflow.start_run():
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
mlflow.log_param("embedding_dim", 384)
mlflow.log_param("model_name", "MiniLM-L6-v2")
mlflow.pytorch.log_model(
model,
"model",
registered_model_name="MiniLM-Defect-Predictor"
)
mlflow.log_metric("inference_latency_ms", 85.3)
mlflow.log_metric("gpu_memory_mb", 2048)
This gives you a full record of what was trained, how it was configured, and the resulting performance.
2. Model Registry and Versioning
Once the run is logged, you can register the model and promote versions through stages like Staging and Production.
Listing 2: Model versioning and stage transitions
from mlflow.tracking import MlflowClient
client = MlflowClient()
version = client.create_model_version(
name="MiniLM-Defect-Predictor",
source="runs://model",
description="MiniLM model for defect prediction"
)
client.transition_model_version_stage(
name="MiniLM-Defect-Predictor",
version=version.version,
stage="Staging"
)
Promoting to production is just another simple transition:
client.transition_model_version_stage(
name="MiniLM-Defect-Predictor",
version=version.version,
stage="Production"
)
Once that happens, everything downstream – FastAPI, Kubernetes, monitoring – will pull the correct production version.
3. FastAPI: Loading the Production Model
FastAPI is the interface layer. Instead of bundling the model with the app, it loads the current production version directly from MLflow.
Listing 3: FastAPI model loading from MLflow registry
import mlflow.pyfunc
from fastapi import FastAPI
app = FastAPI()
MODEL_URI = "models:/MiniLM-Defect-Predictor/Production"
class ModelCache:
_model = None
@classmethod
def get(cls):
if cls._model is None:
cls._model = mlflow.pyfunc.load_model(MODEL_URI)
return cls._model
@app.post("/predict")
def predict(text: str):
model = ModelCache.get()
embedding = model.predict([text])
return {"embedding": embedding.tolist()}
The model is loaded once per process and reused, which avoids repeated GPU initialisation.
4. Kubernetes Deployment (GPU + MLflow)
Below is a simplified version of what runs in production. This demonstrates GPU scheduling, environment injection, and readiness checks.
Inference Pod (FastAPI + GPU)
Listing 4: Kubernetes deployment for GPU-backed inference
apiVersion: apps/v1
kind: Deployment
metadata:
name: carhunch-api
spec:
replicas: 2
selector:
matchLabels: { app: carhunch-api }
template:
metadata:
labels: { app: carhunch-api }
spec:
containers:
- name: api
image: ghcr.io/yourrepo/carhunch-api:latest
env:
- name: MLFLOW_MODEL_URI
value: "models:/MiniLM-Defect-Predictor/Production"
resources:
requests:
cpu: "1"
memory: "4Gi"
nvidia.com/gpu: "1"
limits:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
ports:
- containerPort: 8001
readinessProbe:
httpGet:
path: /ready
port: 8001
MLflow Tracking Server Deployment
For simplicity, this uses SQLite; in practice you can switch to PostgreSQL or MySQL easily.
Listing 5: MLflow tracking server deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: mlflow-tracking
spec:
replicas: 1
selector:
matchLabels: { app: mlflow-tracking }
template:
metadata:
labels: { app: mlflow-tracking }
spec:
containers:
- name: mlflow
image: ghcr.io/mlflow/mlflow:latest
args: ["mlflow", "server", "--backend-store-uri", "sqlite:///mlflow.db"]
ports:
- containerPort: 5000
5. Zero-Downtime Updates (Rolling Strategy)
Kubernetes’ rolling update strategy ensures upgrades happen gradually:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
When a new model is promoted in MLflow (or a new image is released), pods are updated one at a time while keeping the service fully available.
Closing of Part 2
At this point, the core pipeline is in place:
- MLflow tracking server
- Experiment and model logging
- A consistent model registry
- FastAPI loading production models automatically
- GPU-backed Kubernetes deployment
- Zero-downtime updates via rolling releases
In Part 3, we’ll cover:
- Monitoring and prediction logging
- Drift detection
- Canary deployments
- Rolling updates with model-aware routing
- Automated model promotion
Part 3 completes the end-to-end workflow. After that, I’ll publish the separate GPU deep-dive.
Discover more from Don's Blog
Subscribe to get the latest posts sent to your email.
2 thoughts on “MLflow + Kubernetes: Production-Grade Model Serving for Sentence Transformers”