The Future of AI, LLMs, and Observability on Google Cloud (Sponsored)
Discover 7 key insights for technical leaders from Google’s Director of AI, Dr. Ali Arsanjani, and Datadog’s VP of Engineering, Sajid Mehmood. This ebook provides actionable insights around questions such as:
How can organizations better approach AI and LLMs?
How can you build customer confidence in the output of LLMs and LLM-based applications?
How should you evolve your tooling as your maturity with LLMs grows?
Disclaimer: The details in this post have been derived from the Uber Engineering Blog. All credit for the technical details goes to the Uber engineering team. The links to the original articles are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
MySQL serves as the backbone for Uber’s vast and complex operations. For many years, Uber relied upon MySQL version 5.7 to support business-critical features.
However, in 2023, they decided to upgrade from MySQL version 5.7 to version 8.
In this post, we’ll look at the need for this and the challenges Uber faced in such a large-scale upgrade. We will also investigate the solutions Uber used to achieve the upgrade without violating the Service-Level Objective (SLO).
The Need for the Upgrade
The decision to upgrade Uber's MySQL infrastructure from version 5.7 to 8.0 was driven by several critical factors.
First, MySQL 5.7 was reaching its end-of-life, meaning it would no longer receive security updates or bug fixes, leaving Uber's infrastructure vulnerable to potential security risks and operational instability. Upgrading to MySQL 8.0 mitigated these risks by ensuring ongoing support and security improvements.
Additionally, MySQL 8.0 offered significant performance and concurrency enhancements such as:
Improved indexing and resource utilization: This led to faster query execution and better concurrency handling, crucial for Uber’s high-traffic operations.
Enhanced performance: These optimizations reduced latency and improved the overall user experience by supporting smoother transaction processing.
Beyond performance, MySQL 8.0 introduced several new functionalities such as:
Window functions and enhanced JSON handling: These improvements allowed more efficient data querying and manipulation.
Improved spatial data capabilities: This enabled more advanced processing of geographic data which is important for location-based services.
"Dual passwords" for smoother password rotations: This feature allowed Uber to rotate passwords during security incidents without causing service disruptions, enhancing security protocols.
Instant ADD Column functionality: This feature allowed schema changes to be made with minimal downtime, streamlining Uber's database management and ensuring high service availability.
Overall, these performance, security, and operational benefits made the transition to MySQL 8.0 a critical move for Uber's data infrastructure.
Workshop: Implementing Clean Architecture in Next.js (Sponsored)
Lazar Nikolov and Sarah Guthals are hosting a free workshop on Implementing Clean Architecture in Next.js. It will dive deep into what clean architecture *actually* is, what problems it solves, and how to implement it in a Next.js application with Sentry.
The Scale of The Upgrade
Uber’s MySQL infrastructure is vast, operating at a scale that supports its global platform operations. Here are some stats about the overall scale that shows the critical role of MySQL in Uber’s services:
The system is composed of over 2,100 MySQL clusters.
The clusters are spread across 19 production zones in different regions.
More than 16,000 nodes manage the massive volumes of data.
These clusters handle petabytes of data and serve around 3 million queries per second.
Also, to ensure high availability and data redundancy, Uber employs a primary-secondary replication architecture. It works as follows:
Primary node: Responsible for handling all write traffic in each cluster.
Secondary nodes: Replicate the data asynchronously from the primary node, ensuring redundancy and fault tolerance. These secondary nodes are distributed across multiple data centers to enhance data availability and support seamless failover in case of primary node failure.
Challenges with the Upgrade
Several challenges had to be addressed during the upgrade of Uber’s MySQL fleet from version 5.7 to 8.0. Some of the major ones are as follows:
Manual upgrades were not possible due to the sheer scale of Uber’s MySQL infrastructure. It was important to have a detailed upgrade strategy that could be executed efficiently across diverse environments.
Uber’s platform operates globally, meaning that downtime could significantly impact services. Maintaining SLOs throughout the upgrade was crucial.
It was important to ensure compatibility with Uber’s existing applications and services. Since upgrading from MySQL 5.7 to 8.0 introduced new features and syntax changes that could potentially break existing queries, extensive testing was needed.
Uber conducted thorough regression checks and validation tests to ensure all existing systems and applications continued to work seamlessly with the upgraded database.
This process included testing in a staging environment before making production upgrades. By validating every aspect of the system, Uber was able to mitigate the risk of any unexpected issues after the upgrade.
Finally, Uber implemented automated rollback mechanisms to safeguard the upgrade process.
In the event of any failures or compatibility issues during the upgrade, these mechanisms could automatically revert the changes, ensuring the maintenance of service continuity and data integrity.
For instance, in the pre-maintenance stage, where the new MySQL 8.0 nodes operated as replicas, if performance issues or system degradation were detected, Uber could instantly roll back to MySQL 5.7 without any risk of data loss. The rollback capability was crucial for addressing any latency, resource consumption, or service degradation issues, allowing Uber to revert to a stable state until the issues were resolved.
However, once a MySQL 8.0 node was promoted to the primary status, rolling back to MySQL 5.7 became more complex because replication between the new and old versions was no longer possible. In other words, Uber had to ensure everything was functioning correctly before promoting the new nodes to avoid irreversible complications.
Upgrade Strategy
When upgrading its massive MySQL infrastructure from version 5.7 to 8.0, Uber had two possible strategies to choose from: side-by-side upgrade and in-place upgrade.
In-Place Upgrade
An in-place upgrade involves directly upgrading the existing MySQL installation to the new version (MySQL 8.0) on the same nodes.
The process typically requires stopping the MySQL service, upgrading the software, and restarting it. While this method can be simpler in terms of setup, it also comes with significant drawbacks:
Extended downtime: Since the MySQL service must be stopped during the upgrade, this approach leads to a noticeable period of downtime. For a global platform like Uber, even a brief service interruption can have a major impact.
Limited rollback: If issues arise after the upgrade, rolling back to the previous version can be difficult. In-place upgrades provide less flexibility in case of failure, making it harder to revert to a stable state.
Risk of data loss or degradation: Any problems encountered during the in-place upgrade might lead to data loss or degradation of system performance, with fewer opportunities to recover without downtime.
Due to these limitations, Uber decided against the in-place upgrade method.
Side-by-Side Upgrade
Uber chose a side-by-side upgrade approach, which allowed for a smoother and less risky transition.
See the diagram below:
In this method, the new MySQL 8.0 nodes were set up and operated alongside the existing MySQL 5.7 nodes.
This approach was more suitable for Uber’s infrastructure due to the following reasons:
Minimal downtime: With the side-by-side method, the old MySQL 5.7 nodes remained operational while the new MySQL 8.0 nodes were being deployed. This allowed Uber to gradually transfer traffic from the old nodes to the new ones, avoiding significant service disruptions.
Easier rollback: If any issues occurred with the new MySQL 8.0 nodes, Uber could easily revert to the old MySQL 5.7 nodes. Since the old nodes were still running, the rollback process was simple and risk-free, reducing the chance of data loss or service degradation.
Thorough testing: Running the two versions side-by-side allowed Uber to fully test the new MySQL 8.0 nodes with real production traffic before completing the migration. This ensured that problems were detected and addressed before fully switching to the new version.
Scaling the Upgrade Process with Automation
To manage the complexity of upgrading such a large infrastructure, Uber implemented an automated workflow.
With more than 2,100 clusters and over 16,000 nodes, upgrading each node manually was an impossible task. Automation ensured that the process was scalable, efficient, and free from human error.
Two main aspects of this automation are:
Monitoring and alerts: The system was designed to automatically monitor each stage of the upgrade, notifying the engineering team if any problems occurred. This allowed Uber to handle the upgrade across thousands of nodes without risking service stability.
Risk mitigation: The automated workflows minimized the risk of human error and allowed for quick intervention if any issues were detected during the upgrade process.
Four-Stage Upgrade Process for MySQL
Uber’s MySQL upgrade from version 5.7 to 8.0 was carefully planned and executed in a four-stage process.
This approach ensured minimal service disruption and allowed Uber to transition its massive data infrastructure safely. Let’s break down the four stages in simple terms:
1. Pre-Maintenance Stage
In the pre-maintenance stage, new MySQL 8.0 nodes were added as replicas to the existing MySQL 5.7 clusters. A "node" here is a server running a MySQL instance.
By adding these MySQL 8.0 nodes as replicas, they could work alongside the old 5.7 nodes without disrupting any operations.
This setup ensured that the old system (MySQL 5.7) continued functioning normally while the new system (MySQL 8.0) was being integrated, allowing Uber to keep everything running smoothly.
2. System Monitoring (Soak Period)
After setting up the MySQL 8.0 nodes, Uber entered the system monitoring stage, also known as the "soak period." This stage lasted for about a week and was crucial for testing the new system under real-world conditions.
During this time, Uber monitored the MySQL 8.0 nodes as they handled real production traffic (read operations), checking for issues such as slow performance, errors, or increased resource usage.
This period was essential to detect potential problems before making the final switch to MySQL 8.0.
3. Maintenance Stage
Once the soak period confirmed that everything was working smoothly, Uber moved to the maintenance stage.
In this phase, the MySQL 8.0 node was promoted to primary status, meaning it now handled all write operations and became the main database for that cluster.
This promotion marked the point where MySQL 8.0 officially became the main database, while the MySQL 5.7 nodes were demoted or turned off for write traffic.
4. Post-Maintenance Stage
Finally, in the post-maintenance stage, Uber removed all the old MySQL 5.7 nodes that were no longer needed.
At this point, the new MySQL 8.0 nodes were fully operational, and all traffic (both read and write) was being handled by the new system.
By completing this step, Uber successfully transitioned to the new version, ensuring that the system was upgraded without any data loss or significant service disruptions.
Issues During Upgrade
During the upgrade of Uber’s MySQL infrastructure to version 8.0, several issues were encountered that required careful handling and technical solutions to ensure the system continued to run smoothly.
Here’s a breakdown of the key problems and how they were addressed:
Query Execution Plan Changes
One of the major issues that Uber faced was related to changes in the query execution plans in MySQL 8.0.
A query execution plan is the path the database system uses to retrieve data. In some clusters, MySQL 8.0 chose different paths compared to version 5.7, leading to increased latencies (delays) and higher resource consumption.
These changes could slow down certain operations, affecting the performance of dashboards and other tools that relied on quick access to data. For instance, clusters powering key dashboards at Uber experienced noticeable slowdowns.
Uber worked with Percona, a database consulting company, to develop a patch that optimized the execution plans for the affected clusters. By applying this patch, Uber was able to restore performance and reduce resource consumption, bringing the system back to optimal operation.
Unsupported Queries and Configurations
MySQL 8.0 introduced new syntax rules and stricter configurations, which caused some queries that worked in MySQL 5.7 to fail after the upgrade.
Specifically, some clusters didn’t have the STRICT_TRANS_TABLES SQL mode enabled, which is a default setting in MySQL 8.0. This mode enforces stricter rules on handling invalid or missing data.
Uber had to carefully adjust configurations and rewrite certain queries to align with MySQL 8.0’s new syntax and rules. For example, they enabled the STRICT_TRANS_TABLES and ONLY_FULL_GROUP_BY modes, which made the system more robust but required changes to some of the legacy queries and applications.
Collation and Character Set Changes
MySQL 8.0 also brought changes to the default character set and collation. The character set controls how text is stored, and the collation determines how text is compared.
In MySQL 5.7, Uber had been using the utf8mb4_unicode_520_ci collation, but MySQL 8.0 switched to the new utf8mb4_0900_ai_ci collation.
This change in the default character set and collation caused issues with sorting and comparing text data across different clusters, particularly when dealing with different languages or special characters. The system needed consistency in collation settings to function correctly, but this shift created mismatches.
Uber had to align the collation settings across its systems to ensure all nodes used the same character set and collation. This required detailed configuration changes and testing to ensure compatibility and proper sorting behavior across all clusters.
Client Library Incompatibility
Many client libraries that Uber used to connect to the MySQL database were not initially compatible with MySQL 8.0. Client libraries are essential for applications to communicate with the database, and outdated versions of these libraries did not support some of the new features and functions introduced in MySQL 8.0.
Without updating these libraries, Uber’s applications couldn’t fully utilize the benefits of MySQL 8.0, and some applications experienced failures or errors when trying to connect to the upgraded database.
Uber upgraded these client libraries across its systems. This process involved rigorous testing in a staging environment to ensure that all client libraries worked properly with MySQL 8.0 before the full upgrade. Once the testing was complete, the libraries were deployed in production, ensuring a smooth transition.
Improvements After The Upgrade
The upgrade to MySQL 8.0 brought significant performance improvements to Uber’s infrastructure, both on the server side and client side.
Let’s look at both.
Server-Side Performance:
29% improvement in p99 latency for inserts: At high concurrency levels (i.e., when many operations were happening simultaneously), the latency for insert operations improved by 29%, allowing Uber to handle more data input efficiently.
33% improvement in read latency: Queries that required reading data from the database saw a 33% reduction in latency, meaning data retrieval became much faster.
47% improvement in update latency: Similarly, update operations were executed 47% faster, enhancing the overall responsiveness of the system under heavy loads.
Client-Side Performance:
94% reduction in database lock time: The upgrade dramatically reduced the time the system spent waiting for locks on database resources, leading to more efficient transaction processing.
78% reduction in query time for certain queries: Some queries saw a significant 78% reduction in execution time, allowing Uber’s applications to run more smoothly and respond quicker to user requests.
Conclusion
Through careful planning, automation, and a phased rollout strategy, Uber successfully transitioned its vast data systems with minimal downtime and disruption.
The new version brought significant benefits in terms of performance, security, and functionality, helping Uber improve its operational efficiency and user experience.
Some key learnings are as follows:
Automation is Critical: Given the scale of Uber’s MySQL infrastructure, automating the upgrade process was essential to reduce human error and ensure efficiency.
Thorough Testing: Extensive testing, including regression checks and system validation, was necessary to identify and resolve issues before the full production rollout, ensuring that existing applications remained compatible.
Rollback Mechanisms: Building automated rollback mechanisms proved vital to maintain service continuity and prevent data loss in case of unexpected issues during the upgrade.
Collaboration: Working with partners like Percona helped Uber quickly resolve specific issues, such as query execution plan changes and performance bottlenecks.
References:
SPONSOR US
Get your product in front of more than 1,000,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing sponsorship@bytebytego.com
Good article but I missed how primary instance got upgraded to v8. Any links on promotion mechanism?