Stop releasing bugs with fully automated end-to-end test coverage (Sponsored)
Bugs sneak out when less than 80% of user flows are tested before shipping. But how do you get that kind of coverage? You either spend years scaling in-house QA — or you get there in just 4 months with QA Wolf.
How's QA Wolf different?
They don't charge hourly.
They guarantee results.
They provide all of the tooling and (parallel run) infrastructure needed to run a 15-minute QA cycle.
Have you ever seen a database that fails and comes up again in the blink of an eye?
PayPal’s JunoDB is a database capable of doing so. As per PayPal’s claim, JunoDB can run at 6 nines of availability (99.9999%). This comes to just 86.40 milliseconds of downtime per day.
For reference, our average eye blink takes around 100-150 milliseconds.
While the statistics are certainly amazing, it also means that there are many interesting things to pick up from JunoDB’s architecture and design.
In this post, we will cover the following topics:
JunoDB’s Architecture Breakdown
How JunoDB achieves scalability, availability, performance, and security
Use cases of JunoDB
Key Facts about JunoDB
Before we go further, here are some key facts about JunoDB that can help us develop a better understanding of it.
JunoDB is a distributed key-value store. Think of a key-value store as a dictionary where you look up a word (the “key”) to find its definition (the “value”).
JunoDB leverages a highly concurrent architecture implemented in Go to efficiently handle hundreds of thousands of connections.
At PayPal, JunoDB serves almost 350 billion daily requests and is used in every core backend service, including critical functionalities like login, risk management, and transaction processing.
PayPal primarily uses JunoDB for caching to reduce the load on the main source-of-truth database. However, there are also other use cases that we will discuss in a later section.
The diagram shows how JunoDB fits into the overall scheme of things at PayPal.
Why the Need for JunoDB?
One common question surrounding the creation of something like JunoDB is this:
“Why couldn’t PayPal just use something off-the-shelf like Redis?”
The reason is PayPal wanted multi-core support for the database and Redis is not designed to benefit from multiple CPU cores. It is single-threaded in nature and utilizes only one core. Typically, you need to launch several Redis instances to scale out on several cores if needed.
Incidentally, JunoDB started as a single-threaded C++ program and the initial goal was to use it as an in-memory short TTL data store.
For reference, TTL stands for Time to Live. It specifies the maximum duration a piece of data should be retained or the maximum time it is considered valid.
However, the goals for JunoDB evolved with time.
First, PayPal wanted JunoDB to work as a persistent data store supporting long TTLs.
Second, JunoDB was also expected to provide improved data security via on-disk encryption and TLS in transit by default.
These goals meant that JunoDB had to be CPU-bound rather than memory-bound.
For reference, “memory-bound” and “CPU-bound” refer to different performance aspects in computer programs. As the name suggests, the performance of memory-bound programs is limited by the amount of available memory. On the other hand, CPU-bound programs depend on the processing power of the CPU.
For example, Redis is memory-bound. It primarily stores the data in RAM and everything about it is optimized for quick in-memory access. The limiting factor for the performance of Redis is memory rather than CPU.
However, requirements like encryption are CPU-intensive because many cryptographic algorithms require raw processing power to carry out complex mathematical calculations.
As a result, PayPal decided to rewrite the earlier version of JunoDB in Go to make it multi-core friendly and support high concurrency.
The Architecture of JunoDB
The below diagram shows the high-level architecture of JunoDB.
Let’s look at the main components of the overall design.
1 - JunoDB Client Library
The client library is part of the client application and provides an API for storing and retrieving data via the JunoDB proxy.
It is implemented in several programming languages such as Java, C++, Python, and Golang to make it easy to use across different application stacks.
For developers, it’s just a matter of picking the library for their respective programming language and including it in the application to carry out the various operations.
2 - JunoDB Proxy with Load Balancer
JunoDB utilizes a proxy-based design where the proxy connects to all JunoDB storage server instances.
This design has a few important advantages:
The complexity of determining which storage server should handle a query is kept out of the client libraries. Since JunoDB is a distributed data store, the data is spread across multiple servers. The proxy handles the job of directing the requests to the correct server.
The proxy is also aware of the JunoDB cluster configuration (such as shard mappings) stored in the ETCD key-value store.
But can the JunoDB proxy turn into a single point of failure?
To prevent this possibility, the proxy runs on multiple instances downstream to a load balancer. The load balancer receives incoming requests from the client applications and routes the requests to the appropriate proxy instance.
3 - JunoDB Storage Servers
The last major component in the JunoDB architecture is the storage servers.
These are instances that accept the operation requests from the proxy and store data in the memory or persistent storage.
Each storage server is responsible for a set of partitions or shards for an efficient distribution of data.
Internally, JunoDB uses RocksDB as the storage engine. Using an off-the-shelf storage engine like RocksDB is common in the database world to avoid building everything from the ground up. For reference, RocksDB is an embedded key-value storage engine that is optimized for high read and write throughput.
Key Priorities of JunoDB
Now that we have looked at the overall design and architecture of JunoDB, it’s time to understand a few key priorities for JunoDB and how it achieves them.
Scalability
Several years ago, PayPal transitioned to a horizontally scalable microservice-based architecture to support the rapid growth in active customers and payment rates.
While microservices solve many problems for them, they also have some drawbacks.
One important drawback is the increased number of persistent connections to key-value stores due to scaling out the application tier. JunoDB handles this scaling requirement in two primary ways.
1 - Scaling for Client Connections
As discussed earlier, JunoDB uses a proxy-based architecture.
If client connections to the database reach a limit, additional proxies can be added to support more connections.
There is an acceptable trade-off with latency in this case.
2 - Scaling for Data Volume and Throughput
The second type of scaling requirement is related to the growth in data size.
To ensure efficient storage and data fetching, JunoDB supports partitioning based on the consistent hashing algorithm. Partitions (or shards) are distributed to physical storage nodes using a shard map.
Consistent hashing is very useful in this case because when the nodes in a cluster change due to additions or removals, only a minimal number of shards require reassignment to different storage nodes.
PayPal uses a fixed number of shards (1024 shards, to be precise), and the shard map is pre-generated and stored in ETCD storage.
Any change to the shard mapping triggers an automatic data redistribution process, making it easy to scale your JunoDB cluster depending on the need.
The below diagram shows the process in more detail.
Availability
High availability is critical for PayPal. You can’t have a global payment platform going down without a big loss of reputation.
However, outages can and will occur due to various reasons such as software bugs, hardware failures, power outages, and even human error. Failures can lead to data loss, slow response times, or complete unavailability.
To mitigate these challenges, JunoDB relies on replication and failover strategies.
1 - Within-Cluster Replication
In a cluster, JunoDB storage nodes are logically organized into a grid. Each column represents a zone, and each row signifies a storage group.
Data is partitioned into shards and assigned to storage groups. Within a storage group, each shard is synchronously replicated across various zones based on the quorum protocol.
The quorum-based protocol is the key to reaching a consensus on a value within a distributed database. You’ve two quorums:
The Read Quorum: When a client wants to read data, it needs to receive responses from a certain number of zones (known as the read quorum). This is to make sure that it gets the most up-to-date data.
The Write Quorum: When the client wants to write data, it must receive acknowledgment from a certain number of zones to make sure that the data is written to a majority of the zones.
There are two important rules when it comes to quorum.
The sum of the read quorum and write quorum must be greater than the number of zones. If that’s not the case, the client may end up reading outdated data. For example, if there are 5 zones with read quorum as 2 and write quorum as 3, a client can write data to 3 zones but another client may read from the 2 zones that have not yet received the updated data.
The write quorum must be more than half the number of zones to prevent two concurrent write operations on the same key. For example, if there is a JunoDB cluster with 5 zones and a write quorum of 2, client A may write value X to key K and is considered successful when 2 zones acknowledge the request. Similarly, client B may write value Y to the same key K and is also successful when two different zones acknowledge the request. Ultimately, the data for key K is in an inconsistent state.
In production, PayPal has a configuration with 5 zones, a read quorum of 3, and a write quorum of 3.
Lastly, the failover process in JunoDB is automatic and instantaneous without any need for leader re-election or data redistribution. Proxies can know about a node failure through a lost connection or a read request that has timed out.
2 - Cross-data center replication
Cross-data center replication is implemented by asynchronously replicating data between the proxies of each cluster across different data centers.
This is important to make sure that the system continues to operate even if there’s a catastrophic failure at one data center.
Performance
One of the critical goals of JunoDB is to deliver high performance at scale.
This translates to maintaining single-digit millisecond response times while providing a great user experience.
The below graphs shared by PayPal show the benchmark results demonstrating JunoDB’s performance in the case of persistent connections and high throughput.
Security
Being a trusted payment processor, security is paramount for PayPal.
Therefore, it’s no surprise that JunoDB has been designed to secure data both in transit and at rest.
For transmission security, TLS is enabled between the client and proxy as well as proxies in different data centers used for replication.
Payload encryption is performed at the client or proxy level to prevent multiple encryptions of the same data. The ideal approach is to encrypt the data on the client side but if it’s not done, the proxy figures it out through a metadata flag and carries out the encryption.
All data received by the storage server and stored in the engine are also encrypted to maintain security at rest.
A key management module is used to manage certificates for TLS, sessions, and the distribution of encryption keys to facilitate key rotation,
The below diagram shows JunoDB’s security setup in more detail.
Use Cases of JunoDB
With PayPal having made JunoDB open-source, it’s possible that you can also use it within your projects.
There are various use cases where JunoDB can help. Let’s look at a few important ones:
1 - Caching
You can use JunoDB as a temporary cache to store data that doesn’t change frequently.
Since JunoDB supports both short and long-lived TTLs, you can store data from a few seconds to a few days. For example, a use case is to store short-lived tokens in JunoDB instead of fetching them from the database.
Other items you can cache in JunoDB are user preferences, account details, and API responses.
2 - Idempotency
You can also use JunoDB to implement idempotency.
An operation is idempotent when it produces the same result even when applied multiple times. With idempotency, repeating the operation is safe and you don’t need to be worried about things like duplicate payments getting applied.
PayPal uses JunoDB to ensure they don’t process a particular payment multiple times due to retries. JunoDB’s high availability makes it an ideal data store to keep track of processing details without overloading the main database.
3 - Counters
Let’s say you’ve certain resources that aren’t available for some reason or they have an access limit to their usage. For example, these resources can be database connections, API rate limits, or user authentication attempts.
You can use JunoDB to store counters for these resources and track whether their usage exceeds the threshold.
4 - Latency Bridging
As we discussed earlier, JunoDB provides fast inter-cluster replication. This can help you deal with slow replication in a more traditional setup.
For example, in PayPal’s case, they run Oracle in Active-Active mode, but the replication usually isn’t as fast as they would like for their requirement.
It means there are chances of inconsistent reads if records written in one data center are not replicated in the second data center and the first data center goes down.
JunoDB can help bridge the latency where you can write to Data Center A (both Oracle and JunoDB) and even if it goes down, you can read the updates consistently from the JunoDB instance in Data Center B.
See the below diagram for a better understanding of this concept.
Conclusion
JunoDB is a distributed key-value store playing a crucial role in various PayPal applications. It provides efficient data storage for fast access to reduce the load on costly database solutions.
While doing so, it also fulfills critical requirements such as scalability, high availability with performance, consistency, and security.
Due to its advantages, PayPal has started using JunoDB in multiple use cases and patterns. For us, it provides a great opportunity to learn about an exciting new database system.
References:
Very interesting article and quite fascinating as well!
I wonder how it compares to AWS DynamoDB, being also a key-value datastore?
What kind of needs would make you choose one over the other?
Great post!
Thank you for including JunoDB in the bytebytego newsletter. Your support means a lot to us!