Vibe Tooling: why I dropped ClickHouse for DuckDB
Over the past year, I’ve been refining how I work with vehicle data at CarHunch. My goal has been to keep things lean, version-controlled, and fully observable, without relying on a large infrastructure stack. Part of that journey involved moving from ClickHouse to DuckDB, and I wanted to share why that switch fits my workflow – and show it in action too.
Why DuckDB fits my mindset
ClickHouse is incredible when it comes to aggregating massive datasets quickly, but it has a quirk: your data is not always immediately predictable. Data can appear missing or partial until background merges and materialized views catch up. Over time, the database is deterministic, but that eventual consistency can make it tricky to be certain you’ve got exactly what you expect at any given moment.
To be clear: ClickHouse is a beast for massive, streaming data and high-throughput ingest. For CarHunch‘s specific scale it was overkill, and the operational simplicity of DuckDB won out. For the “how big is the dataset?” question: we’re at roughly 50GB of Parquet – about 136 million vehicles, 800 million-plus MOT tests, and 1.8 billion defect records. That’s serious data, but it’s not “spin up a cluster” territory; DuckDB’s process-local, single-machine model fits without any custom memory tuning.
For me, DuckDB aligns better with a DevOps-style, reproducible workflow. Parquet files are first-class citizens: I can snapshot, roll back, or branch datasets like I would code, and there’s no DB server – that took some mental adjusting; there’s just DuckDB and flat files, no clusters or replicas to manage. When I query a Parquet file I know exactly what I’m reading; I can script everything locally, test in CI, or run on a laptop with the same data as production. And I can be absolutely certain about that.
In practice, version control means the Parquet files live in a versioned directory (or alongside the repo); DuckDB reads them directly and the same SQL runs everywhere. For example:
Reading from versioned Parquet – same query in dev or CI
-- Parquet path is in config or env; no server, no migrations
SELECT * FROM read_parquet('parquet/canonical/vehicle_mot_profile.parquet')
WHERE registration = 'AB12CDE';
For my workflow, this simplicity and reproducibility are more valuable – and a much better fit – than ClickHouse’s massive scale.
CarHunch demo: DuckDB in action
I’ve recorded a live demo to show how CarHunch works with DuckDB; here’s a summary of what you actually see:
- Start the CarHunch backend locally. You see a message confirming the backend is ready.
- Pre-warm the cache for a few sample registrations:
KY17CYS,OIG5527,GD16ZXO. Each registration lookup hits DuckDB and stores the resulting data in memory. - Lookup a registration (
SM09KXX). This uses the pre-warmed cache, so the lookup is extremely fast (<15ms). - Display the JSON response for that registration, including MOT history, defects, tax status, and metadata.
- Show backend logs with detailed timings, including per-query performance, cache hits, and DuckDB reading Parquet files.
š¬ Watch the demo here:
A note on realism: the demo uses pre-warmed cache to show fast responses. In normal CarHunch usage, searching for a fresh registration takes roughly 1 second as DuckDB reads from Parquet and loads the vehicle’s data; once that’s loaded, related features – Comparisons to similar cars and the AI-generated Hunches report – use cached data and respond very quickly.
This setup demonstrates both cold-query performance and the speed benefits of caching, while remaining realistic about what users experience.
It also shows that all of the data is coming from just that one parquet file – there is no other db or cluster behind this.
Version-controlled analytics: my Vibe Tooling approach
Vibe Tooling isn’t about the highest theoretical throughput; it’s more about the lowest cognitive overhead. The idea is to keep your analytics tools in line with your personal workflow, not the constraints of traditional infra-heavy setups. Everything is scriptable and reproducible: DuckDB + Parquet lets me treat data like code, precompute or cache selectively, and benchmark in real time. There’s no cluster, no external database servers, no “mystery merges” like ClickHouse can produce in the background. I can experiment, roll back, and iterate quickly, which keeps development fast and friction low.
In short, the move to DuckDB isn’t about technology or performance; it’s about aligning tools with my workflow and making data exploration and analytics feel natural.
Wrapping up
Moving from ClickHouse to DuckDB was a conscious trade-off: I gave up some of ClickHouse’s impressive scale and power for predictability, reproducibility, and simplicity.
CarHunch now runs entirely on Parquet + DuckDB: no DB server, no cluster, no infra team – just fast, version-controlled queries, cached computations, and reliable AI Hunches.
For anyone looking to keep their data stack lean, portable, and reproducible – where the goal is lowest cognitive overhead, not highest theoretical throughput – I think this approach is worth considering.
Discover more from Don's Blog
Subscribe to get the latest posts sent to your email.