Cut Code Review Time & Bugs into Half with CodeRabbit (Sponsored)
CodeRabbit is an AI Code Reviewer that helps you or your team merge your code changes faster with superior code quality. CodeRabbit doesn’t just point out issues; it suggests fixes and explains the reasoning behind the suggestions. Elevate code quality with AI-powered, context-aware reviews and 1-click fixes.
CodeRabbit provides:
• Automatic PR summaries and file-change walkthroughs.
• Runs popular linters like Biome, Ruff, PHPStan, etc.
• Highlights code and configuration security issues.
• Enables you to write custom code review instructions and AST grep rules.
To date, CodeRabbit has reviewed more than 5 million PRs, is installed on a million repositories, has 15k+ daily developer interactions, and is used by 1000+ organizations.
PS: CodeRabbit is free for open-source.
Disclaimer: The details in this post have been derived from the Agoda Engineering Blog. All credit for the technical details goes to the Agoda engineering team. The links to the original articles are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
Agoda sends around 1.8 trillion events per day through Apache Kafka.
Since 2015, the Kafka usage at Agoda has grown tremendously with a 2x growth YOY on average.
Kafka supports multiple use cases at Agoda, which are as follows:
Analytical data management
Feeding data into the data lake
Near real-time monitoring and alerting solutions
Building asynchronous APIs
Data replication across data centers
Serving data to and from Machine Learning pipelines
As the scale and Kafka usage grew, multiple challenges forced Agoda’s engineering team to develop solutions.
In this post, we’ll examine some key challenges that Agoda faced and the solutions they implemented.
Simplifying how Developers Send Data to Kafka
One of the first changes Agoda made was around sending data to Kafka.
Agoda built a 2-step logging architecture:
A client library writes events to disk. It handles file rotations and manages the write file locations.
A separate daemon process (Forwarder) reads the events and forwards them to Kafka. It is responsible for reading the files, sending the events to Kafka, tracking file offsets, and managing the deletion of completed files.
See the diagram below:
The architecture separates operational concerns away from development teams, allowing the Kafka team to perform tasks like dynamic configuration, optimizations, and upgrades independently. The client library has a simplified API for producers, enforces serialization standards, and adds a layer of resiliency.
The tradeoff is increased latency for better resiliency and flexibility, with a 99-percentile latency of 10s for analytics workloads. For critical and time-sensitive use cases requiring sub-second latency, applications can bypass the 2-step logging architecture and write to Kafka directly.
Splitting Kafka Clusters Based On Use Cases
Agoda made a strategic decision early on to split their Kafka clusters based on use cases instead of having a single large Kafka cluster per data center.
This means that instead of having one massive Kafka cluster serving all kinds of workloads, they have multiple smaller Kafka clusters, each dedicated to a specific use case or set of use cases.
The main reasons for this approach are:
By having separate clusters for different use cases, any issues that arise in one cluster will be contained within that cluster and won’t affect the others.
Different use cases may have different requirements in terms of performance, reliability, and data retention.
For example, a cluster used for real-time data processing might be configured with lower data retention periods and higher network throughput to handle the high volume of data.
In addition to splitting Kafka clusters by use case, Agoda also provisions dedicated physical nodes for Zookeeper, separate from the Kafka broker nodes. Zookeeper is a critical component in a Kafka cluster, responsible for managing the cluster's metadata, coordinating broker leader elections, and maintaining configuration information.
Stop renting auth. Make it yours instead.(Sponsored)
Developer-Centric: True API first design, quick integration, built on standards, highly flexible & customizable.
Hosting Flexibility: You host or we host - the choice is yours with no loss of features.
Unlimited: Unlimited IDPs, unlimited users, unlimited tenants, unlimited applications, always free.
Total Control: Deploy on any computer, anywhere in the world and integrate easily with any tech stack.
Scale Confidently: Lightning-fast performance for 10 users or 10 million users (or more).
Data Isolation: Single tenant by design means your data is physically isolated from everyone else’s.
FusionAuth is a complete auth & user platform that has 15M+ downloads and is trusted by industry leaders.
Monitoring and Auditing Kafka
From a monitoring point of view, Agoda uses JMXTrans to collect Kafka broker metrics.
JMXTrans is a tool that connects to JMX (Java Management Extensions) endpoints and collects metrics. These metrics are then sent to Graphite, a time-series database that stores numeric time-series data.
The collected metrics include things like broker throughput, partition counts, consumer lag, and various other Kafka-specific performance indicators.
The metrics stored in Graphite are visualized using Grafana, a popular open-source platform for monitoring and observability. Grafana allows the creation of customizable dashboards that display real-time and historical data from Graphite.
For auditing, Agoda implemented a custom Kafka auditing system. The primary goal of this auditing system is to ensure data completeness, reliability, accuracy, and timeliness across the entire Kafka pipeline.
Here’s how it works:
Audit counts are generated at various points throughout the pipeline.
A separate thread runs in the background on Agoda’s client libraries as part of the 2-step logging architecture we discussed earlier. This thread asynchronously aggregates message counts across time buckets to generate audits.
The generated audit data is stored in a separate Kafka cluster dedicated to audit information. This ensures the audit data doesn’t interfere with the main data pipelines.
The audit information ultimately ends up in two places:
Whitefalcon: Agoda’s internal near real-time analytics platform
Hadoop: For longer-term storage and analysis.
Authentication and ACLs
Initially, Agoda’s Kafka clusters were used primarily for application telemetry data, and authentication wasn’t deemed necessary.
As Kafka usage grew exponentially, concerns arose about the inability to identify and manage users who might be abusing or negatively impacting Kafka cluster performance. Agoda completed and released its Kafka Authentication and Authorization system in 2021.
The Authentication and Authorization system consists of the following components:
Core Kafka Authentication: It likely uses SASL (Simple Authentication and Security Layer) mechanisms supported by Kafka.
ACLs: Access Control Lists for fine-grained permission management.
Credential Generation: A custom component for creating and managing user credentials.
Credential Assignment: A system to associate credentials with specific users or teams.
Self-Service Portal: An interface allowing teams to request Kafka credentials and ACLs without direct intervention from the Kafka team.
Kafka Load Balancing
Agoda, as an online travel booking platform, aims to offer its customers the most competitive and current prices for accommodations and services from a wide range of external suppliers, including hotels, restaurants, and transportation providers.
To achieve this, Agoda's supply system is designed to efficiently process and incorporate a vast number of real-time price updates received from these suppliers. A single supplier can provide 1.5 million price updates and offer details in just one minute. Any delays or failures in reflecting these updates can lead to incorrect pricing and booking failures.
Agoda uses Kafka to handle these incoming price updates. Kafka partitions help them achieve parallelism by distributing the workload across mple partitions and consumers.
See the diagram below:
The Partitioner and Assignor Strategy
Apache Kafka's message distribution and consumption are heavily influenced by two key strategies: the partitioner and the assignor.
The partitioner strategy determines how incoming messages are allocated across partitions during production. Common approaches include round-robin distribution and sticky partitioning.
On the consumer side, the assignor strategy dictates how partitions are distributed among consumers within a consumer group. Examples include range assignments and round-robin assignments.
See the diagram below for reference:
Traditionally, these strategies were designed with the assumption that all consumers have similar processing capabilities and that all messages require roughly the same amount of processing time.
However, Agoda's real-world scenario deviated from these assumptions, leading to significant load-balancing challenges in their Kafka implementation.
There were two primary challenges:
Hardware Heterogeneity: Agoda's use of a private cloud infrastructure with Kubernetes resulted in pods being deployed across servers with varying hardware specifications. Benchmark tests revealed substantial performance disparities between different hardware generations.
Inconsistent Message Workloads: The processing requirements for different messages varied considerably. Some messages necessitated additional steps such as third-party API calls or database queries, leading to unpredictable processing times and latency fluctuations.
These challenges ultimately resulted in an over-provisioning problem, where resources were inefficiently allocated to compensate for the load imbalances caused by hardware differences and varying message processing demands.
Overprovisioning Problem at Agoda
The over-provisioning involves allocating more resources than necessary to handle the expected peak workload efficiently.
To illustrate this, let's consider a scenario where Agoda's processor service employs Kafka consumers running on heterogeneous hardware:
They have two high-performance workers, each capable of processing 20 messages per second.
Additionally, they have one slower worker that can only handle 10 messages per second.
Theoretically, this setup should be able to process a total of 50 messages per second (20 + 20 + 10). However, when using a round-robin distribution strategy, each worker receives an equal share of the messages, regardless of their processing capabilities. If the incoming message rate consistently reaches 50 messages per second, the following issues arise:
The two faster workers can comfortably handle their allocated share of approximately 16.7 messages per second each.
The slower worker, on the other hand, struggles to keep up with its assigned 16.7 messages per second, resulting in a growing lag over time.
See the diagram below
To maintain acceptable latency and meet processing SLAs, Agoda would need to allocate additional resources to this setup.
In this example, they would have to scale out to five machines to effectively process 50 messages per second. This means that they are overprovisioning by two extra machines due to the inefficient distribution logic that fails to consider the varying processing capabilities of the hardware.
A similar scenario can occur when the processing workload for each message varies, even if the hardware is homogeneous.
In both cases, this leads to several negative consequences:
Higher hardware costs due to the need for additional resources.
Inefficient utilization of resources, with some consumers being underutilized while others are overburdened.
Increased maintenance overhead to manage the overprovisioned infrastructure.
The round-robin distribution strategy, while ensuring an equal distribution of messages across consumers, fails to account for the heterogeneity in hardware performance and message processing workload.
Agoda’s Dynamic Lag-Aware Solution
To solve this, Agoda adopted a dynamic, lag-aware approach to solving the Kafka load balancing challenges. They didn’t opt for static balancing solutions like weighted load balancing due to messages having non-uniform workloads.
They implemented two main strategies:
Lag-aware Producer
Lag-aware Consumer
Lag-Aware Producer
A lag-aware producer is a dynamic approach to load balancing in Apache Kafka that adjusts message partitioning based on the current lag information of the target topic.
It works as follows:
The producer maintains a cached copy of partition lag data to minimize the frequency of requests to Kafka brokers for this information.
The producer uses the lag data to intelligently distribute messages across partitions using a custom algorithm. The algorithm is designed to send fewer messages to partitions with high lag and more messages to partitions with low lag. They use algorithms like the same-queue length algorithm and outlier detection algorithm.
When the lags across partitions are balanced and stable, the lag-aware producer ensures an even distribution of messages.
Let's consider an example scenario in Agoda's supply system, where an internal producer publishes task messages to a processor.
The target topic has 6 partitions with the following lag distribution:
Partition 1: 110 messages
Partition 2: 150 messages
Partition 3: 80 messages
Partition 4: 400 messages
Partition 5: 120 messages
Partition 6: 380 messages
In this situation, the lag-aware producer would identify that partitions 4 and 6 have significantly higher lag compared to the other partitions. As a result, it would adapt its partitioning strategy to send fewer messages to partitions 4 and 6 while directing more messages to the partitions with lower lag (partitions 1, 2, 3, and 5).
By dynamically adjusting the message distribution based on the current lag state, the lag-aware producer helps to rebalance the workload across partitions, preventing further lag accumulation on the already overloaded partitions.
Lag-Aware Consumer
Lag-aware consumers are a solution employed when multiple consumer groups are subscribed to the same Kafka topic, making lag-aware producers less effective.
The process works as follows:
In a downstream service, such as Agoda's Processor, if a particular consumer instance detects that it has fallen significantly behind in processing messages (i.e., it has a high lag), it can voluntarily unsubscribe from the topic. This action triggers a rebalance operation.
During the rebalance, a custom partition Assigner, developed by Agoda, reassigns the partitions across all the remaining consumer instances. The redistribution takes into account each consumer's current lag and processing capacity, ensuring a more balanced workload.
To minimize the performance impact of rebalancing, Agoda leverages Kafka 2.4's incremental cooperative rebalance protocol. This protocol allows for more frequent partition reassignments without causing significant disruptions to the overall processing flow.
Let's illustrate this with an example.
Suppose Agoda's Processor service has three consumer instances (workers) that are consuming messages from six partitions of a topic:
Worker 1 is responsible for processing messages from Partitions 1 and 2
Worker 2 handles Partitions 3 and 4
Worker 3 processes messages from Partitions 5 and 6
If Worker 3 happens to be running on older, slower hardware compared to the other workers, it may struggle to keep up with the message influx in Partitions 5 and 6, resulting in higher lag. In this situation, Worker 3 can proactively unsubscribe from the topic, triggering a rebalance event.
During the rebalance, the custom Assigner evaluates the current lag and processing capacity of each worker and redistributes the partitions accordingly. For example, it may assign Partition 5 to Worker 1 and Partition 6 to Worker 2, effectively relieving Worker 3 of its workload until the lag is reduced to an acceptable level.
Conclusion
In conclusion, Agoda's journey with Apache Kafka has been one of continuous growth, learning, and adaptation.
By implementing strategies such as the 2-step logging architecture, splitting Kafka clusters based on use cases, developing robust monitoring and auditing systems, and Kafka load balancing Agoda has successfully managed the challenges that come with handling 1.8 trillion events per day.
As Agoda continues to evolve and grow, its Kafka setup will undoubtedly play a crucial role in supporting the company's ever-expanding needs. The various solutions also provide great learning for other software developers in the wider community when it comes to adapting Kafka to their organizational needs.
References:
SPONSOR US
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com
Thanks for the very informative post.
In section "Lag-Aware Producer" it is mentioned that there is lag data for partitions of a topic. As the processing and thereby lag is a consumer side property i.e. different consumer groups can have different lag in partitions, how is this lag mapped to producer side lag?
Thanks for the post!
It seems the Lag-Aware Producer is used for events without a key. For events with keys, this strategy may not be suitable due to the risk of out-of-order processing.
The Lag-Aware Consumer, however, raises some concerns. According to the concept, "if a particular consumer instance detects that it has fallen significantly behind in processing messages (i.e., it has high lag), it can voluntarily unsubscribe from the topic."
Typically, when lag increases, it’s often not due to slower hardware but rather an insufficient number of consumers in the group to handle the workload in parallel. If a lagging consumer unsubscribes in such cases, it could make the problem worse by placing even more load on the remaining consumers.