3 Comments
User's avatar
cch's avatar

Super interested in hearing more from their team in the future about the part "rethinking the separation between indexing and timeseries storage in our metrics platform" they mentioned in their blog.

Expand full comment
Emy Rose's avatar

Keeping track of customer reviews everywhere can get overwhelming fast. Using HiFive Star helps you monitor and respond to feedback all in one place. It usually leads to smoother reputation management without the usual hassle.

Expand full comment
Manan Chopra's avatar

Great deep dive into Monocle and Datadog's architecture! After analyzing the design further, I wanted to add some points that weren't explicitly covered but are worth discussing:

Query Flow Between Index DB and RTDB

The article mentions the separation of metadata (Index DB) and time series data (RTDB), but doesn't detail how they coordinate. The Index DB resolves human-readable tag queries (like service:api, region:us-east) into a list of tag hashes. These hashes then become direct lookups in RTDB. A single query can return thousands of hashes if you're querying with wildcards or partial tag matches — each unique tag combination is its own series with its own hash.

Hash-Based Worker Partitioning

The "thread-per-core" model works because partitioning isn't random. Each worker owns a deterministic set of hashes (likely hash % num_workers). Kafka partitions align with this, so each worker only reads from partitions containing data it owns. This is why there's no coordination needed — the routing is baked into the hash itself.

Kafka as WAL: Failure Scenarios

The article mentions Kafka as the WAL, but what happens during extended outages? Key considerations:

- Kafka's retention policy (time or size-based) determines how long messages survive. If

RTDB is down longer than retention, data loss occurs.

- Consumer offset commits likely happen after memtable writes, enabling replay on crash.

- Back-pressure can propagate to the Metrics Edge, which may need to shed load rather than

overwhelm Kafka.

- Dead letter queues likely handle poison messages to prevent blocking.

Why Not Elasticsearch?

The choice to build custom storage makes sense when you consider Elasticsearch's overhead inverted index updates on every write, no time series-specific compression (delta encoding, Gorilla compression), and query patterns optimised for search/ranking rather than time-range scans. At billions of points per second, that overhead becomes untenable.

Would love to hear from the Datadog team about their offset commit strategy and how they handle the Index DB + RTDB merge they mentioned in future plans!

Expand full comment