How Reddit Migrated Petabyte-Scale Kafka from…

Mar 17

In this article, we will look at the breakdown of this migration, the challenges the engineering team faced, and how they achieved their goal of a successful migration.

Read →

7 Comments

ToxSec

Mar 17

“Reddit’s migration is a good example of how large-scale infrastructure changes do not have to be dramatic, high-risk events.”

really great takeaway for me. love seeing how the dreaded events can be handled so well. nice post thanks!

cch

Mar 20

The migration reminds us the "strangler pattern".

cch

Mar 20

Would love to see how they migrated operator from forked one to standard one. Did they deploy both for a period for time or un-deploy first ?

Keshore

Mar 19

Migrating with mirror maker would become too resource intensive for this scale. This is clever design

Kate Johnson

Mar 18

Migrating petabyte-scale data while staying live is basically performing open-heart surgery while the patient is running a marathon. I've always been skeptical of "zero-downtime" claims because the reality is usually a mess of edge cases and TTL headaches. It actually reminds me of the physical version of this—trying to do a structural overhaul on a building while people are still living in it. We’ve been looking at some of the project logs over at https://qualityrenovation.com just to see how they handle the sequencing of high-stakes onsite work without killing the "uptime" for the residents. There’s a weirdly similar logic between managing a physical job site and managing a data migration; if your staging isn’t perfect, the whole thing collapses the moment you cut over. Does Reddit have a public post-mortem on the specific consistency issues they hit during the final sync?

Aravind Karthik

Mar 18Edited

This migration is incredibly simple with cluster linking if you decide to move confluent Kafka. A precondition is the Kafka is being used with a small retention period.

Rakia Ben Sassi

Mar 26

Here’s the thing nobody tells you when you graduate from “I deploy to a VPS” to “I’m cloud-native now”:

Kubernetes is not a more reliable version of your old server. It’s a fundamentally different relationship with reliability. And if you approach it the same way, your pods will keep dying and you’ll keep losing sleep.

Let’s talk about it.

https://rakiabensassi.substack.com/p/the-kubernetes-mortality-rate-everything?utm_campaign=post-expanded-share&utm_medium=web

ByteByteGo Newsletter

How Reddit Migrated Petabyte-Scale Kafka from…