Advanced risk management for Freqtrade strategies








Advanced Risk Management for Freqtrade: Integrating Real-Time Market Awareness | Technical Deep Dive


Advanced Risk Management for Freqtrade: Integrating Real-Time Market Awareness

Freqtrade is a popular open-source cryptocurrency trading bot framework. It gives you solid tools for strategy development, backtesting, and running automated trading strategies live – and it does a very good job of evaluating individual trade entries and exits.

I’ve used Freqtrade extensively, both for testing ideas and for running strategies live.

One thing I kept running into, though, was that while Freqtrade is very good at answering one question:

“Is this a valid entry signal right now?”

…it doesn’t really answer a different, higher-level one:

“Is this a market worth trading at all right now?”

If you’ve run Freqtrade strategies with real money, you’ve probably seen the same pattern: strategies that look perfectly reasonable in backtests, with sensible entry logic and risk controls, can still bleed during periods of high volatility, regime shifts, or market-wide panic – even when individual entries look fine in isolation.

That gap is what led me to experiment with a separate market-level risk layer, which eventually became Remora.

Rather than changing strategy logic or adding yet another indicator, Remora sits outside the strategy and provides a market-level risk assessment – answering whether current conditions are historically safe or risky to trade, regardless of what your entry signals are doing.

Importantly, this is an additive layer – your strategy logic and entry signals remain unchanged.

This article walks through how that works, how to integrate it into Freqtrade safely, and how to validate its impact using reproducible backtests.

TL;DR: This article shows how to add real-time market risk filtering to Freqtrade using Remora, a small standalone microservice that aggregates volatility, regime, sentiment, and macro signals. Integration is fail-safe, transparent, and requires only a minimal change to your strategy code.

What This Article Covers

  • Why market regime risk matters for Freqtrade strategies
  • What Remora does (at a high level)
  • How to integrate it safely without breaking your strategy
  • How to validate its impact using reproducible backtests

Who This Is For (And Who It Isn’t)

This is likely useful if you:

  • Run live Freqtrade bots with real capital
  • Care about drawdowns and regime risk, not just backtest curves
  • Want a fail-safe, auditable risk layer
  • Prefer transparent systems over black-box signals

This is probably not useful if you:

  • Want a plug-and-play “buy/sell” signal
  • Optimise single backtests rather than live behaviour
  • Expect risk filters to magically fix bad strategies

Part 1: The Missing Layer in Most Freqtrade Strategies

Market conditions aren’t always tradable. Periods of extreme volatility, panic regimes, bear markets, and negative sentiment cascades can turn otherwise solid strategies into consistent losers – even when individual entries look fine in isolation.

Typical Freqtrade risk controls (position sizing, stop-losses, portfolio exposure) protect individual trades, but they don’t address market regime risk – the question of whether current market conditions are fundamentally safe to trade in at all.

Part 2: Remora – Market-Wide Risk as a Service

Remora is a standalone market-risk engine designed to sit outside your strategy logic.

Instead of changing how your strategy finds entries, Remora answers one question:

“Are current market conditions safe to trade?”

Results at a Glance (Why This Matters)

Before diving into implementation details, it’s useful to see what this approach looks like in practice.

Across 6 years of data (2020-2025), 4 different strategies, and 20 backtests:

  • 90% of tests improved performance (18 out of 20)
  • +1.54% average profit improvement
  • +1.55% average drawdown reduction
  • 4.3% of trades filtered (adaptive – increases to 16-19% during bear markets)
  • Strongest impact during bear markets (2022 saw 16-19% filtering during crashes)

All results are fully reproducible using an open-source backtesting framework (details later).

Core Design Principles

  • Fail-open by default: If Remora is unavailable, your bot continues trading normally.
  • Transparent decisions: Every response includes human-readable reasoning.
  • Multi-source aggregation: Dozens of signals with redundancy and failover.
  • Low-latency: Designed for synchronous use inside live trading loops.
  • No lock-in: Simple HTTP API. Remove it by deleting a few lines of code.

Data Aggregation Strategy (High-Level)

Rather than relying on a single indicator, Remora combines multiple signal classes:

Technical & Market Structure:

  • Volatility metrics (realised, model-based)
  • Momentum indicators
  • Regime classification (bull / bear / choppy / panic)
  • Volume and market structure signals

Sentiment & Macro:

  • News sentiment (multi-source)
  • Fear & Greed Index
  • Funding rates and liquidations
  • BTC dominance
  • Macro correlations (e.g. VIX, DXY)

Each signal type has multiple providers. If one source fails or becomes stale, others continue supplying data.

The output is:

  • safe_to_trade (boolean)
  • risk_score (0-1)
  • market regime
  • volatility metrics
  • clear textual reasoning

Part 3: Freqtrade Integration (Minimal & Reversible)

Integration uses Freqtrade’s confirm_trade_entry hook.

You do not modify your strategy’s entry logic – you simply gate entries at the final step.

Step-by-Step Integration

Here’s exactly what to add to your existing Freqtrade strategy. The code is color-coded: gray shows your existing code, green shows the new Remora integration code.

Step 0: Set Your API Key

Before running your strategy, set the environment variable:

export REMORA_API_KEY=”your-api-key-here”

Get your free API key at remora-ai.com/signup.php

Step 1: Add Remora to Your Strategy

Insert the green code blocks into your existing strategy file exactly as shown:

class MyStrategy(IStrategy):
    # ----- EXISTING STRATEGY LOGIC -----
    def populate_entry_trend(self, dataframe: DataFrame, metadata: dict) -> DataFrame:
        pair = metadata['pair']
        
        # Your existing entry conditions...
        # dataframe.loc[:, 'enter_long'] = 1  # example existing logic

        # ----- REMORA CHECK -----
        if not self.confirm_trade_entry(pair):
            dataframe.loc[:, 'enter_long'] = 0  # REMORA: Skip high-risk trades

        return dataframe

    # ----- ADD THIS NEW METHOD -----
    def confirm_trade_entry(self, pair: str, **kwargs) -> bool:
        import os
        import requests
        api_key = os.getenv("REMORA_API_KEY")
        headers = {"Authorization": f"Bearer {api_key}"} if api_key else {}
        
        try:
            r = requests.get(
                "https://remora-ai.com/api/v1/risk",
                params={"pair": pair},
                headers=headers,
                timeout=2.0
            )
            return r.json().get("safe_to_trade", True)  # REMORA: Block entry if market is high-risk
        except Exception:
            return True  # REMORA: Fail-open

Integration Notes:

  • Inside your existing populate_entry_trend(), insert the green Remora check just before return dataframe.
  • After that, add the green confirm_trade_entry() method at the same indentation level as your other strategy methods.
  • All green comments are prefixed with # REMORA: so you can easily identify or remove them later.
  • Everything else in your strategy stays unchanged.

Removing Remora is as simple as deleting these lines. No lock-in, fully transparent.

Pair-Specific vs Market-Wide Risk

You can query Remora in two modes:

Pair-specific:

params={“pair”: “BTC/USDT”}

Market-wide (global trade gating):

# No pair parameter

Many users start with market-wide gating to reduce API calls and complexity.

What the API Returns

{
“safe_to_trade”: false,
“risk_score”: 0.75,
“regime”: “bear”,
“volatility”: 0.68,
“reasoning”: [
“High volatility detected”,
“Bear market regime identified”,
“Fear & Greed Index: Extreme Fear”,
“Negative news sentiment”
]
}

This allows debugging blocked trades, auditing decisions, custom logic layered on top, and strategy-specific thresholds.

Part 4: Backtesting & Validation (Reproducible)

Live APIs don’t work in historical backtests – so Remora includes an open-source backtesting framework that reconstructs historical risk signals using the same logic as production.

Repository: github.com/DonaldSimpson/remora-backtests

What It Provides

  • Historical signal reconstruction
  • Baseline vs Remora-filtered comparisons
  • Multiple strategy types
  • Consistent metrics and visualisations

What It Shows

  • Improvements are not strategy-specific
  • Filtering increases during crashes
  • Small trade suppression can meaningfully reduce drawdowns
  • Performance gains come from avoiding bad periods, not over-trading

You’re encouraged to run this yourself and independently verify the impact on your own strategies.

Here’s what comprehensive backtesting across 6 years (2020-2025), 4 different strategies, and 20 test cases has proven:

Overall Performance Improvements

Metric Result
Average Profit Improvement +1.54% (18 out of 20 tests improved – 90% success rate)
Average Drawdown Reduction +1.55% (18 out of 20 tests improved)
Trades Filtered 4.3% (2,239 out of 51,941 total trades)
Best Strategy Improvement +3.20% (BollingerBreakout strategy)
Most Effective Period 2022 Bear Market (16-19% filtering during crashes)

Financial Impact by Account Size

Based on average improvements, here’s the financial benefit on different account sizes:

Account Size Additional Profit Reduced Losses Total Benefit
$10,000 +$154.25 +$154.70 $308.95
$50,000 +$771.25 +$773.50 $1,544.75
$100,000 +$1,542.50 +$1,547.00 $3,089.50
$500,000 +$7,712.50 +$7,735.00 $15,447.50
$1,000,000 +$15,425.00 +$15,470.00 $30,895.00

What These Numbers Mean

  • 4.3% Trade Filtering: Remora prevents trades during dangerous market periods. This is adaptive – during the 2022 bear market, filtering increased to 16-19%, showing Remora becomes more protective when markets are most dangerous.
  • +1.54% Profit Improvement: By avoiding bad trades during high-risk periods, strategies show consistent profit improvements. 90% of tests (18 out of 20) showed improvement.
  • +1.55% Drawdown Reduction: Less maximum loss during unfavorable periods. This is critical for risk management and capital preservation.
  • Best During Crashes: Remora is most effective during bear markets and crashes (2022 showed 16-19% filtering), exactly when you need protection most.

Part 5: Production & Advanced Use

Always fail-open:

except requests.Timeout:
return True

Log decisions:

logger.info(
f”Remora: safe={safe}, risk={risk_score}, regime={regime}”
)

Reduce API load:

  • Cache responses (e.g. 30s)
  • Use market-wide checks
  • Upgrade tier only if needed

Advanced Uses (Optional)

  • Dynamic position sizing based on risk_score
  • Strategy-specific risk thresholds
  • Regime-based strategy switching
  • Trade blocking during macro stress events

These are additive – not required to get value.

Part 6: Technical Implementation Details

Data Pipeline Architecture

Remora’s data pipeline follows a producer-consumer pattern:

  1. Data Collection: Multiple scheduled tasks fetch data from various sources (Binance API, CoinGecko, news APIs, etc.)
  2. Data Storage: Raw data stored in ClickHouse time-series database
  3. Materialized Views: ClickHouse materialized views pre-aggregate data for fast queries
  4. Risk Calculation: Python service calculates risk scores using aggregated data
  5. Caching: Redis caches risk assessments to reduce database load
  6. API Layer: FastAPI serves risk assessments via REST API

ClickHouse Materialized Views

ClickHouse materialized views enable real-time aggregation without query-time computation overhead:

CREATE MATERIALIZED VIEW volatility_1h_mv
ENGINE = AggregatingMergeTree()
ORDER BY (pair, timestamp_hour)
AS SELECT
pair,
toStartOfHour(timestamp) as timestamp_hour,
avgState(price) as avg_price,
stddevSampState(price) as volatility
FROM raw_trades
GROUP BY pair, timestamp_hour;

This allows Remora to provide real-time risk assessments with minimal latency, even when processing millions of data points.

Failover & Redundancy

Each data source has multiple providers with automatic failover. This ensures reliable risk assessments even if individual data sources experience outages or rate limiting.

def get_fear_greed_index():
“””
Fetch Fear & Greed Index with multi-provider failover.
Tries multiple sources until one succeeds.
“””
providers = [
fetch_from_alternative_me,
fetch_from_coinmarketcap,
fetch_from_custom_source,
fetch_from_backup_provider_1,
fetch_from_backup_provider_2,
# … additional providers for redundancy
]

# Try each provider until one succeeds
for provider in providers:
try:
data = provider()
if data and is_valid(data):
return data
except Exception:
continue

# If all providers fail, return None
# The risk calculator handles missing data gracefully
return None

This multi-provider approach ensures:

  • High Availability: If one provider fails, others continue providing data
  • Rate Limit Resilience: Multiple providers mean you’re not dependent on a single API’s rate limits
  • Data Quality: Can validate data across providers and choose the most reliable source
  • Graceful Degradation: If all providers for one signal type fail, the risk calculator continues using other available signals (volatility, regime, sentiment, etc.)

In Remora’s implementation, each signal type (Fear & Greed, news sentiment, funding rates, etc.) has multiple providers. If one data source is unavailable, others continue providing information, ensuring the system maintains reliable risk assessments even during external API outages.

Security & Best Practices

  • API Key Management: Store API keys in environment variables, never in code
  • HTTPS Only: Always use HTTPS for API calls (Remora enforces this)
  • Rate Limiting: Respect rate limits to avoid service disruption
  • Monitoring: Monitor Remora API response times and error rates
  • Fail-Open: Always implement fail-open behaviour – never let Remora block your entire trading system

API Access & Pricing

Remora offers a tiered API access structure designed to accommodate different use cases:

Unauthorized Access (Limited)

  • Rate Limit: 60 requests per minute
  • Use Case: Testing, development, low-frequency strategies
  • Cost: Free – no registration required
  • Limitations: Lower rate limits, no historical data access

Registered Users (Free Tier)

  • Rate Limit: 300 requests per minute (5x increase)
  • Use Case: Production trading, multiple strategies, higher-frequency bots
  • Cost: Free – registration required, no credit card needed
  • Benefits: Higher rate limits, faster response times, priority support

Pro Tier (Coming Soon)

  • Rate Limit: Custom limits based on needs
  • Use Case: Professional traders, institutions, high-frequency systems
  • Features:
    • Customizable risk thresholds and filtering rules
    • Advanced customization options
    • Historical data API access for backtesting
    • Dedicated support and SLA guarantees
    • White-label options
  • Status: Currently in development – contact for early access
Getting Started: Start with the free registered tier – it’s sufficient for most Freqtrade strategies. Upgrade to Pro when you need customization, higher limits, or advanced features.

Getting Started

To get started with Remora for your Freqtrade strategies:

  1. Get API Key: Sign up at remora-ai.com/signup.php (free, no credit card required). Registration gives you 5x higher rate limits (300 req/min vs 60 req/min).
  2. Set Environment Variable: export REMORA_API_KEY="your-api-key-here"
  3. Add Integration: Add the confirm_trade_entry method to your strategy (see color-coded code examples above)
  4. Test: Run a backtest or paper trade to verify integration
  5. Validate with Backtests: Use the remora-backtests repository to run your own strategy with and without Remora, independently verifying the impact
  6. Monitor: Review logs to see Remora’s risk assessments and reasoning

Conclusion

Market regime risk is one of the most common reasons profitable backtests fail live.

Remora adds a thin, transparent, fail-safe risk layer on top of Freqtrade that helps answer whether current market conditions are safe to trade in. It doesn’t replace your strategy – it protects it.

Beyond Freqtrade: While Remora is optimised for Freqtrade users, the same REST API integration pattern works with any trading bot or custom trading system that can make HTTP requests.
Ready to get started? Visit remora-ai.com to get your free API key and start protecting your Freqtrade strategies from high-risk market conditions.

Resources

About the Author: This article was written as part of building Remora, a production-grade market risk engine for algorithmic trading systems. The system is built using modern Python async frameworks (FastAPI), time-series databases (ClickHouse), and MLOps best practices for real-time data aggregation and risk assessment.

Have questions about integrating Remora with Freqtrade? Found this useful? I’d love to hear your feedback or see your integration examples. Feel free to reach out or share your experiences.


Production-grade monitoring, prediction logging, and safe deployment workflows







Monitoring, Drift Detection and Zero-Downtime Model Releases | Tech Deep Dive


Monitoring, Drift Detection and Zero-Downtime Model Releases

Part 3 of 3: Production-grade monitoring, prediction logging, and safe deployment workflows.

Introduction

In Part 1 and Part 2, we built the core of the system: reproducible training, a proper model registry, and Kubernetes-backed deployments.

Now the focus shifts to what happens after a model goes live.

This post covers the production-side essentials:

  • Logging predictions for operational visibility
  • Detecting model drift
  • Canary deployments and safe rollout workflows
  • Automated model promotion
  • The real-world performance improvements

1. Logging Predictions for Monitoring

To understand how the system behaves in production, every prediction is logged – lightweight, structured, and tied back to model versions via MLflow.

Listing 1: Prediction logging to MLflow

import mlflow
import time

def log_prediction(text, latency, confidence):
    with mlflow.start_run(nested=True):
        mlflow.log_param("input_length", len(text))
        mlflow.log_metric("latency_ms", latency)
        mlflow.log_metric("confidence", confidence)
        mlflow.log_metric("timestamp", time.time())

This gives you enough data to build dashboards showing:

  • Latency trends
  • Throughput
  • Confidence drift
  • Input distribution changes
  • Model performance over time

Even simple plots can reveal early warning signs long before they become user-visible issues.

2. Drift Detection Script

A basic example of analysing logged metrics for unusual changes:

Listing 2: Model drift detection

import numpy as np
from mlflow.tracking import MlflowClient

def detect_drift():
    client = MlflowClient()

    runs = client.search_runs(
        experiment_ids=["0"],
        filter_string="metrics.latency_ms > 0",
        max_results=500
    )

    latencies = [r.data.metrics["latency_ms"] for r in runs]
    confs = [r.data.metrics["confidence"] for r in runs]

    if np.mean(latencies) > 120:
        alert("Latency drift detected")

    if np.mean(confs) < 0.75:
        alert("Confidence drift detected")

You can plug in more advanced statistical tests later (KL divergence, embedding space drift, or decayed moving averages).

3. Canary Deployment (10% Traffic)

A canary deployment lets you test the new model under real load before promoting it fully.

Versioned pods:

Listing 3: Canary deployment configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: carhunch-api-v2
spec:
  replicas: 1
  template:
    metadata:
      labels:
        app: carhunch-api
        version: "v2"

The service routes traffic to both versions:

selector:
  app: carhunch-api

With 1 replica of v2 and (for example) 9 replicas of v1, the canary receives roughly 10% of requests.

Kubernetes handles the balancing naturally.

4. Automated Promotion Script

A simple automated workflow to move models through Staging → Canary → Production:

Listing 4: Automated model promotion workflow

def promote_model(version):
    # 1. Move to staging
    client.transition_model_version_stage(
        "MiniLM-Defect-Predictor",
        version,
        "Staging"
    )

    # 2. Deploy canary
    subprocess.run(["kubectl", "scale", "deployment/carhunch-api-v2", "--replicas=1"])

    # 3. Wait and collect metrics
    time.sleep(3600)

    # ...evaluate metrics here...

    # 4. Promote to production if everything looks good
    client.transition_model_version_stage(
        "MiniLM-Defect-Predictor",
        version,
        "Production"
    )

This keeps the deployment pipeline simple but still safe:

  • No big-bang releases
  • Measurable confidence before promotion
  • Fully automated transitions if desired

5. Performance Gains

Metric Before After Improvement
Deployment downtime 15–30 min 0 min 100%
Inference latency ~120ms ~85ms ~29% faster
Prediction cost £500/mo £5/mo 99% cheaper
GPU stability Frequent leaks Stable Fully fixed
Traceability None Full MLflow registry 100%

These improvements came primarily from:

  • Moving off external API calls
  • Running inference locally on a small GPU
  • Using MLflow for proper version tracking
  • Cleaner model lifecycle management

Final Closing: What's Next

With this final part complete, the full workflow now covers:

  • MLflow model registry and experiment tracking
  • FastAPI model serving
  • GPU-backed Kubernetes deployments
  • Prediction monitoring and drift detection
  • Canary releases and safe rollouts
  • Zero-downtime updates

There's one major topic left that deserves its own article:

Deep GPU + Kubernetes Optimisation

Memory fragmentation, batching strategies, GPU sharing, node feature discovery, device plugin tuning - the stuff that affects real-world performance far more than most people expect.

That full technical deep-dive is coming next.


MLflow + Kubernetes: Production-Grade Model Serving for Sentence Transformers







MLflow + Kubernetes: Production-Grade Model Serving for Sentence Transformers | Tech Deep Dive


Production-Grade Model Serving for Sentence Transformers

Part 2 of 3: A practical walk-through of model versioning, registry management, API serving, and GPU-backed Kubernetes deployment.

Introduction

In Part 1, I covered the motivations behind moving to a more structured MLOps setup.

This post focuses on how everything fits together: MLflow, the model registry, FastAPI, and Kubernetes.

The goal is simple: a predictable, reproducible way to train models, log them, promote them, and deploy them – all without downtime.

Everything shown here is based on the system I run in production.

1. Setting Up MLflow Tracking

MLflow acts as the central source of truth. Every experiment, configuration, and model version is logged there.

Python: Logging a training run

Listing 1: MLflow experiment tracking

import mlflow
import mlflow.pytorch
from sentence_transformers import SentenceTransformer

mlflow.set_tracking_uri("http://mlflow:5000")
mlflow.set_experiment("vehicle-defect-prediction")

with mlflow.start_run():
    model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

    mlflow.log_param("embedding_dim", 384)
    mlflow.log_param("model_name", "MiniLM-L6-v2")

    mlflow.pytorch.log_model(
        model,
        "model",
        registered_model_name="MiniLM-Defect-Predictor"
    )

    mlflow.log_metric("inference_latency_ms", 85.3)
    mlflow.log_metric("gpu_memory_mb", 2048)

This gives you a full record of what was trained, how it was configured, and the resulting performance.

2. Model Registry and Versioning

Once the run is logged, you can register the model and promote versions through stages like Staging and Production.

Listing 2: Model versioning and stage transitions

from mlflow.tracking import MlflowClient

client = MlflowClient()

version = client.create_model_version(
    name="MiniLM-Defect-Predictor",
    source="runs://model",
    description="MiniLM model for defect prediction"
)

client.transition_model_version_stage(
    name="MiniLM-Defect-Predictor",
    version=version.version,
    stage="Staging"
)

Promoting to production is just another simple transition:

client.transition_model_version_stage(
    name="MiniLM-Defect-Predictor",
    version=version.version,
    stage="Production"
)

Once that happens, everything downstream – FastAPI, Kubernetes, monitoring – will pull the correct production version.

3. FastAPI: Loading the Production Model

FastAPI is the interface layer. Instead of bundling the model with the app, it loads the current production version directly from MLflow.

Listing 3: FastAPI model loading from MLflow registry

import mlflow.pyfunc
from fastapi import FastAPI

app = FastAPI()
MODEL_URI = "models:/MiniLM-Defect-Predictor/Production"

class ModelCache:
    _model = None

    @classmethod
    def get(cls):
        if cls._model is None:
            cls._model = mlflow.pyfunc.load_model(MODEL_URI)
        return cls._model

@app.post("/predict")
def predict(text: str):
    model = ModelCache.get()
    embedding = model.predict([text])
    return {"embedding": embedding.tolist()}

The model is loaded once per process and reused, which avoids repeated GPU initialisation.

4. Kubernetes Deployment (GPU + MLflow)

Below is a simplified version of what runs in production. This demonstrates GPU scheduling, environment injection, and readiness checks.

Inference Pod (FastAPI + GPU)

Listing 4: Kubernetes deployment for GPU-backed inference

apiVersion: apps/v1
kind: Deployment
metadata:
  name: carhunch-api
spec:
  replicas: 2
  selector:
    matchLabels: { app: carhunch-api }
  template:
    metadata:
      labels: { app: carhunch-api }
    spec:
      containers:
      - name: api
        image: ghcr.io/yourrepo/carhunch-api:latest
        env:
        - name: MLFLOW_MODEL_URI
          value: "models:/MiniLM-Defect-Predictor/Production"
        resources:
          requests:
            cpu: "1"
            memory: "4Gi"
            nvidia.com/gpu: "1"
          limits:
            cpu: "4"
            memory: "16Gi"
            nvidia.com/gpu: "1"
        ports:
        - containerPort: 8001
        readinessProbe:
          httpGet:
            path: /ready
            port: 8001

MLflow Tracking Server Deployment

For simplicity, this uses SQLite; in practice you can switch to PostgreSQL or MySQL easily.

Listing 5: MLflow tracking server deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-tracking
spec:
  replicas: 1
  selector:
    matchLabels: { app: mlflow-tracking }
  template:
    metadata:
      labels: { app: mlflow-tracking }
    spec:
      containers:
      - name: mlflow
        image: ghcr.io/mlflow/mlflow:latest
        args: ["mlflow", "server", "--backend-store-uri", "sqlite:///mlflow.db"]
        ports:
        - containerPort: 5000

5. Zero-Downtime Updates (Rolling Strategy)

Kubernetes’ rolling update strategy ensures upgrades happen gradually:

strategy:
  type: RollingUpdate
  rollingUpdate:
    maxSurge: 1
    maxUnavailable: 0

When a new model is promoted in MLflow (or a new image is released), pods are updated one at a time while keeping the service fully available.

Closing of Part 2

At this point, the core pipeline is in place:

  • MLflow tracking server
  • Experiment and model logging
  • A consistent model registry
  • FastAPI loading production models automatically
  • GPU-backed Kubernetes deployment
  • Zero-downtime updates via rolling releases

In Part 3, we’ll cover:

  • Monitoring and prediction logging
  • Drift detection
  • Canary deployments
  • Rolling updates with model-aware routing
  • Automated model promotion

Part 3 completes the end-to-end workflow. After that, I’ll publish the separate GPU deep-dive.


MLOps at Scale: Serving Sentence Transformers in Production







MLOps at Scale: Serving Sentence Transformers in Production | Tech Deep Dive


Serving sentence transformers in Production

Part 1 of 3 on how I moved a large-scale vehicle prediction system from “working but manual” to a clean, production-grade MLflow + Kubernetes setup.

Introduction: Converting a group of local experiments in to a real service

I built a system to analyse MOT history at large scale: 1.7 billion defects and test records, 136 million vehicles, and over 800 million individual test entries.

The core of it was straightforward: generate 384-dimensional MiniLM embeddings and use them to spot patterns in vehicle defects.

Running it locally was completely fine. Running it as a long-lived service while managing GPU acceleration, reproducibility, versioning, and proper monitoring was the real challenge. Things worked ok, but it became clear that the system needed a more structured approach as traffic and data grew.

I kept notes on what I thought was going wrong and what I needed to improve:

  • I had no easy way to track which model version the API was currently serving
  • Updating the model meant downtime or manual steps
  • GPU utilisation wasn’t predictable and occasionally needed a restart
  • Monitoring and metrics were basic at best
  • There was no clean workflow for testing new models without risking disruption

All the normal growing pains you’d expect – the system worked, but it wasn’t something I wanted to maintain long-term in that shape!

That pushed me to formalise the workflow with a proper MLOps stack. This series walks through exactly how I transitioned the service to MLflow, Kubernetes, FastAPI, and GPU-backed deployments.

As a bonus, moving things to use local GPU inference brought my (rapidly growing) API charges down to a few £/month for just the hardware & eletricity!

The MLOps Requirements

Before choosing tools, I wrote down what I actually needed rather than choosing tech first:

1. Zero-downtime deployments

Rolling updates and safe testing of new models.

2. Real model versioning

A clear audit trail of what ran, when, and with what parameters.

3. Better visibility

Latency, throughput, GPU memory usage, embedding consistency.

4. Stable GPU serving

Avoid unnecessary fragmentation or reloading under load.

5. Performance and scale

  • 1,000+ predictions/sec
  • <100ms latency
  • Efficient single-GPU operation

6. Cost-effective inference

Run locally rather than paying per-request.

Why MLflow + Kubernetes?

MLflow gave me:

  • Experiment tracking
  • A proper model registry
  • Version transitions (Staging → Production)
  • Reproducibility
  • A single source of truth for what version is deployed

Kubernetes gave me:

  • Zero-downtime, repeatable deployments
  • GPU-aware scheduling
  • Horizontal scaling and health checks
  • Clean separation between environments
  • Automatic rollback if something misbehaves

FastAPI provided:

  • A lightweight, async inference layer
  • A clean boundary between model, API, and app logic

The Architecture (High-Level)

This post covers the initial problems, requirements, and overall direction.

Part 2 goes deep into MLflow, the registry, and Kubernetes deployment.

Part 3 focuses on monitoring, drift detection, canaries, and scaling.

I’ll also publish a dedicated GPU/Kubernetes deep-dive later – covering memory fragmentation, batching, device plugin configuration, GPU sharing, and more.

The Practical Issues I Wanted to Improve

These weren’t “critical failures”, just things that become annoying or risky at scale:

1. Knowing which model version is running

Without a registry, it was easy to lose track.

2. Manual deployment steps

Fine for experiments, less so for a live service.

3. Occasional GPU memory quirks

SentenceTransformers sometimes leaves memory allocated longer than ideal.

4. Limited monitoring

I wanted clearer insight into latency, drift, and GPU usage.

5. No safe model testing workflow

I needed a way to expose just a slice of traffic to new models.

What the Final System Achieved

  • 99.9% uptime
  • Zero-downtime model updates
  • ~50% latency improvement
  • Stable GPU utilisation
  • Full visibility into predictions
  • Drift detection and alerting
  • ClickHouse scale for billions of rows
  • Running cost around £5/month

That’s about it for Part 1

In Part 2, I’ll show the exact MLflow & the Kubernetes setup:

  • How experiments are logged
  • How the model registry is structured
  • How the API automatically loads the current Production model
  • Kubernetes deployment manifests
  • GPU-backed pods and health checks
  • How rolling updates actually work

Then Part 3 covers:

  • Monitoring every prediction
  • Drift detection
  • Canary deployments
  • Rolling updates
  • Automated model promotion

And the GPU deep-dive will follow as a separate post