How Reddit Migrated Petabyte-Scale Kafka from EC2 to Kubernetes
Cut Code Review Time & Bugs in Half (Sponsored)
Code reviews are critical but time-consuming. CodeRabbit acts as your AI co-pilot, providing instant Code review comments and potential impacts of every pull request.
Beyond just flagging issues, CodeRabbit provides one-click fix suggestions and lets you define custom code quality rules using AST Grep patterns, catching subtle issues that traditional static analysis tools might miss.
CodeRabbit reviews 1 million PRs every week across 3 million repositories and is used by 100 thousand Open-source projects.
CodeRabbit is free for all open-source repo’s.
The Reddit Engineering Team completed one of the most demanding infrastructure migrations in the company’s history. It moved its entire Apache Kafka fleet, comprising over 500 brokers and more than a petabyte of live data, from Amazon EC2 virtual machines onto Kubernetes.
The migration was done with zero downtime and without asking a single client application to change how it connected to Kafka.
In this article, we will look at the breakdown of this migration, the challenges the engineering team faced, and how they achieved their goal of a successful migration.
Disclaimer: This post is based on publicly shared details from the Reddit Engineering Team. Please comment if you notice any inaccuracies.
The Role of Kafka at Reddit
To put things into perspective, let us first understand what exactly Apache Kafka is.
Apache Kafka is an open-source message streaming platform. Applications called producers write messages into Kafka partitions, and other applications called consumers read those messages out. Kafka sits in the middle and stores those messages reliably, even if the producer and consumer are running at completely different times. A single Kafka server is called a broker, whereas a collection of brokers working together forms a cluster.
At Reddit, Apache Kafka is not a peripheral tool. It sits underneath hundreds of business-critical services, processing tens of millions of messages every second. If Kafka went down, large portions of Reddit would break.
Why Reddit Wanted to Move Away from EC2
Before the migration, Reddit managed its Kafka brokers on Amazon EC2 instances using a combination of Terraform, Puppet, and custom scripts. Operators handled upgrades, configuration changes, and machine replacements by running commands directly from their laptops. This worked fine until a certain point. However, as the fleet grew, it became increasingly slow, error-prone, and expensive. Reddit needed a more scalable and reliable way to operate Kafka.
Kubernetes, paired with a tool called Strimzi, offered that path.
Kubernetes is an open-source platform for running and managing containerized applications. Instead of manually provisioning and maintaining individual servers, Kubernetes lets developers describe what should be running and handles deployment, scaling, and recovery automatically. Strimzi, on the other hand, is a project under the Cloud Native Computing Foundation that specifically lets you run Kafka on Kubernetes. It provides a declarative way to manage Kafka clusters. This means that developers can describe what they want in a configuration file, and Strimzi handles deployment, upgrades, and maintenance. This promised fewer manual interventions and more predictable operations.
The Four Constraints That Shaped the Migration
Reddit did not jump straight into moving brokers. Before writing a single line of migration code, Reddit identified four hard constraints that ruled out entire categories of approaches. The constraints are as follows:
Kafka had to stay up. There was no acceptable maintenance window. Downtime, data loss, or forcing client applications to change their configuration was not an option. This ruled out scheduled cutovers, dual-write strategies, and replay-based migrations.
Kafka’s metadata could not be rebuilt from scratch. Apache Kafka maintains a detailed internal state called metadata. This includes information about which brokers exist, which broker holds which data, and where replicas of that data are stored. ZooKeeper, an external service, was responsible for managing this metadata. There is no supported way to recreate this metadata on a fresh cluster while keeping the system available. New brokers had to join the existing cluster rather than replace it.
Client connectivity was tightly coupled to specific brokers. Over time, applications across Reddit had been configured to connect directly to specific broker hostnames, typically the first few brokers in a cluster, rather than using a single load-balanced endpoint. Turning off those brokers would immediately break hundreds of services. Reddit did not control the layer through which clients found and connected to Kafka.
Every step had to be reversible. No single action during the migration could leave the system in a state from which recovery was impossible. This meant Reddit had to accept a long period where EC2 brokers and Kubernetes brokers ran side by side, and it meant that riskier changes had to wait until everything else was stable.
Phase 1: Taking Control of the Naming Layer
The first phase of the migration did not touch Kafka at all.
Reddit introduced a DNS facade, which is a set of DNS records that act as an intermediate layer between client applications and the actual Kafka brokers. DNS is the system that translates human-readable names into the addresses of servers. By creating new, infrastructure-controlled DNS names that initially pointed to the same EC2 brokers, Reddit changed nothing from the perspective of client applications.
Reddit then rolled out these new connection strings across more than 250 services using automated tooling that generated batch pull requests to update configuration files. Once all clients were talking through this DNS layer, Reddit could change where those names pointed, from EC2 to Kubernetes, without modifying any client code.
Phase 2: Making Room for New Brokers
Each Kafka broker is identified by a unique numeric ID. Strimzi assigns broker IDs starting at 0 by default. However, Reddit’s existing EC2 brokers already occupied those low numbers.
To free up that ID space, Reddit doubled the cluster size by adding new EC2 brokers with higher IDs, then terminated the original low-numbered brokers. This shifted all data onto the higher-numbered brokers and opened up IDs 0, 1, 2, and so on for Strimzi-managed brokers to use.
See the diagram below:
Phase 3: Running a Mixed Cluster
This was the most technically complex phase.
Reddit needed Strimzi brokers running on Kubernetes to join the same cluster as the existing EC2 brokers and communicate with them directly. Strimzi does not support this out of the box, so Reddit created a fork of the Strimzi operator. The changes Reddit made were deliberately small and targeted:
The inter-broker listener configuration was set to use plaintext listeners accessible from both EC2 and Kubernetes, ensuring brokers in different environments could talk to each other.
The ZooKeeper connection was pointed at Reddit’s existing EC2-hosted ZooKeeper, so that both old and new brokers shared the same metadata store and were part of the same logical cluster.
The Cruise Control topic was overridden to stay consistent across both broker sets, allowing Reddit to use Cruise Control to move data between EC2 and Kubernetes brokers. Cruise Control is a Kafka tool that automates the process of rebalancing data across brokers in a controlled, measured way. It was central to the actual movement of data during the migration.
Running a forked operator in production carries risk. Reddit kept the scope of changes narrow and planned from the start to switch back to the standard Strimzi operator once the migration was complete.
Phase 4: Gradually Shifting Data and Traffic
With both sets of brokers running inside the same cluster, Reddit used Cruise Control to incrementally move partition leadership and replicated data from EC2 brokers to the Kubernetes brokers.
Partition leadership determines which broker is responsible for serving reads and writes for a given piece of data. Kafka stores copies of each partition on multiple brokers for redundancy. This is called the replication factor. Moving data meant reassigning both the leadership and the replicas to the new set of brokers, one partition at a time.
Reddit monitored this process continuously as the partition leadership on EC2 declined steadily over roughly a week while leadership on Strimzi climbed in parallel. Network traffic followed the same pattern. At every point, Reddit could pause or reverse the process if something looked wrong.
See the dashboard view below:

Phase 5: Migrating the Control Plane
ZooKeeper had managed Kafka’s metadata throughout the entire broker migration. Reddit made a deliberate choice not to change the control plane until after the data plane was fully stable on Kubernetes. This separation of concerns reduced the risk of compounding failures.
Once all EC2 brokers were terminated and all data and traffic were running on Kubernetes, Reddit executed the migration from ZooKeeper to KRaft. KRaft is Kafka’s built-in metadata management system that eliminates the need for ZooKeeper.
See the diagram below:
Since Strimzi and Kafka both provide documented steps for this migration, and because the rest of the system had already settled, this final phase was comparatively straightforward.
Phase 6: Cleaning Up and Handing Off to Standard Strimzi
After both the data plane and the control plane were fully running on Kubernetes, Reddit removed all the configuration overrides that the forked Strimzi operator had introduced.
Control of the clusters was handed off to the standard, unmodified Strimzi operator. The EC2 infrastructure was decommissioned.
Conclusion
Reddit’s migration is a good example of how large-scale infrastructure changes do not have to be dramatic, high-risk events. By breaking the work into small, reversible, well-understood steps and by respecting the constraints the system imposed, Reddit moved a petabyte-scale platform to Kubernetes without a single moment of downtime.
Some key lessons from Reddit’s migration journey were as follows:
Introducing a controllable abstraction layer between clients and infrastructure, whether that is DNS, a proxy, or an API gateway, is one of the highest-leverage changes you can make during a migration. It decouples the two sides and lets you change the infrastructure without forcing every team to update their code.
Metadata and logical state tend to outlive the physical machines they run on. When planning any large migration, treat the logical state as the thing you are protecting, and treat the infrastructure as something you are replacing around it.
Designing each step to be undoable is not just a safety measure. It changes how confidently and quickly you can move forward, because you know you can always step back if something goes wrong.
A migration that looks messy in the middle but never breaks production is far preferable to a clean design that requires a moment where things could go wrong with no recovery path.
References:







“Reddit’s migration is a good example of how large-scale infrastructure changes do not have to be dramatic, high-risk events.”
really great takeaway for me. love seeing how the dreaded events can be handled so well. nice post thanks!