Guide to Accelerating DevOps Transformation (Sponsored)
The DevOps model can engender faster development cycles and enhanced agility in responding to market needs.
Datadog's DevOps Kit gives you the resources to create and strengthen cultures of observability, collaboration, and data sharing within organizations—key pillars of the DevOps movement.
Gain instant access to:
2 eBooks detailing methods for building and enabling effective development and operations teams
A Solutions Brief describing ways to get full visibility into your DevOps tools
A Technical Talk Video detailing the benefits of adopting a data-driven DevOps mindset
Disclaimer: The details in this post have been derived from the articles/presentations made by the Netflix engineering team. All credit for the architectural details goes to the Netflix engineering team. The links to the original articles are present in the references section at the end of the post. We’ve attempted to analyze the details and provide our input about them. If you find any inaccuracies or omissions, please leave a comment, and we will do our best to fix them.
As a subscription-based streaming service, Netflix's primary revenue source is its membership business. With a staggering 238 million members worldwide, managing memberships efficiently is crucial for the company's success and continued growth.
The membership platform at Netflix plays a vital role in handling the entire lifecycle of a user's subscription.
The membership lifecycle consists of various stages and scenarios:
Signup: A user may begin their Netflix journey by signing up for the service, either directly or through partner channels such as T-Mobile and others.
Plan Changes: Existing members can modify their subscription plans according to their preferences and needs.
Renewal: As a subscription service, Netflix automatically attempts to renew a user’s plan using the payment method associated with the account.
Payment Issues: In case of problems with the payment gateway or insufficient funds, a user’s account may be put on hold or granted a grace period to resolve the issue.
Membership Pause or Cancellation: Users have the option to temporarily pause their membership or permanently cancel their subscriptions.
In the following sections, we will explore the architectural decisions made by the Netflix engineering team to support the various capabilities and scalability of its membership platform.
The High-Level Architecture of Netflix Membership Platform
Before diving into the details of Netflix's membership platform, let's take a step back and examine how the company's original pricing architecture was designed.
In the early days, Netflix's pricing model was relatively straightforward, with only a handful of plans to manage and basic functionality to support.
To meet these initial requirements, Netflix employed a lightweight, in-memory library. This approach proved to be quite efficient, as the limited scope of the pricing system allowed for a simple and streamlined design.
The diagram below illustrates this basic architecture:
As Netflix expanded its global presence and diversified its offerings, the lightweight, in-memory library that initially served the pricing architecture became insufficient.
The growing complexity and scope of the pricing catalog, coupled with its increasing importance across multiple applications, led to operational challenges. The library's size and dependencies made it difficult to maintain and scale, necessitating a transition to a more robust and scalable architecture.
The diagram below shows the high-level architecture of Netflix’s modern membership platform:
The membership platform consists of a dozen microservices and is designed to support four nines (99.99%) availability.
This high availability requirement originates from the platform's critical role in various user-facing flows. If any of the services experience downtime, it can directly impact the user experience.
The platform supports several key functionalities:
When a user hits the play button on Netflix, a direct call is made to the membership systems to determine the quality of service associated with their plan. Factors such as the allowed concurrent streams and supported devices for the user are considered. This flow handles the highest traffic volume, as Netflix serves billions of streaming requests every day.
The membership flow is triggered when a user accesses their account page. Actions like changing plans, managing extra members, and cancellations directly interact with the membership services.
The platform serves as the authoritative source for the total membership count at any given point in time. It emits events and writes to a persistent store, which is consumed by downstream analytics systems within and outside Netflix.
The key points about the architecture diagram are as follows:
The membership platform manages the membership plan and pricing catalog globally, with variations across different regions. The Plan Pricing Catalog Service handles rule management based on location-specific offerings.
Two CockroachDB databases are utilized to store plan pricing and code redemption information. The Member Pricing Service supports member actions, such as changing plans or adding extra members.
A dedicated microservice handles partner interactions, including bundle activations, signups, and integration with platforms like Apple's App Store.
Membership data is stored in Cassandra databases, which support the Subscription Service and History Tracking Service.
The platform not only caters to the 238 million active memberships but also focuses on former members and rejoin experiences.
The data generated by the platform is shipped to downstream consumers for deriving insights on signups and revenue projections.
Netflix’s choice of using CockroachDB and Cassandra is interesting.
While CockroachDB provides strong consistency, making it suitable for critical data such as the plan pricing information, Cassandra is a highly scalable NoSQL database for handling large volumes of membership data.
Also, the 99.99% availability indicates a strong focus on resilience and fault tolerance. Netflix is anyways famous for its comprehensive chaos engineering practices to proactively test the system’s resilience.
Latest articles
If you’re not a paid subscriber, here’s what you missed.
To receive all the full articles and support ByteByteGo, consider subscribing:
The Signup Process Flow
Once users embark on their Netflix journey, they encounter plan selection options.
Rendering the plan selection page accurately is of utmost importance due to the geographical variations in currency, pricing, and available plans. Netflix's membership platform ensures that users are presented with the appropriate options based on their location and device type.
The diagram below shows the detailed steps involved in the Netflix signup process and the services that are triggered during the flow:
Here’s a look at each step in more detail.
The journey begins with users selecting a plan through Netflix's growth engineering apps. The plan details are retrieved from the Membership Plan Catalog service, which is backed by CockroachDB.
The Membership Plan Catalog Service loads and reads the plans based on predefined region and device type rules.
The retrieved plans are then presented to the user, allowing them to make an informed decision based on their preferences and budget.
Once the user chooses a plan, the flow progresses to the payment confirmation screen. Here, users provide their payment details and confirm their subscription.
Upon confirmation, the user clicks the "Start Membership" button, triggering the Membership State Service. This service persists the relevant information, such as the selected plan, price tier, and country, into the Cassandra database.
The Membership State Service also notifies the Billing Engineering Apps about the payment.
The Billing Engineering Apps generate an invoice based on the signup data obtained from the Membership Pricing Service.
The membership data is simultaneously written to the Membership History Service, ensuring a comprehensive record of the user's subscription history.
Events are published to signal the activation of the membership. These events trigger messaging pipelines responsible for sending welcome emails to the user and informing downstream systems for analytics purposes.
How Member History Is Tracked?
In the early stages of Netflix's membership platform, member history and data were tracked through application-level events.
While this approach sufficed initially, it became evident that a more granular and persistent data tracking solution was necessary as Netflix expanded and the complexity of member data increased.
To address this need, Netflix developed a robust solution based on the Change Data Capture (CDC) pattern.
For reference, CDC is a design pattern that directly captures changes made to a database and propagates those changes to downstream systems for further processing or analysis.
The diagram below shows how the CDC process works:
Adopting a CDC-like approach ensures that all delta changes made to the membership data sources are recorded in an append-only log system, which is backed by a Cassandra database.
The diagram below shows the flow of historical data in Netflix’s membership platform:
Let’s walk through the steps in this process:
Suppose there is an update request to modify the billing partner for a member. The request is received by the Membership Subscription Service.
The Membership Subscription Service processes the request and updates the relevant information in the Cassandra database.
In addition to updating the primary database, the updated data for the user is appended to the Membership History Service. This service is responsible for maintaining a historical record of all changes made to membership data.
The Membership History Service takes the appended data and inserts it into the Member History Table. This table serves as a persistent store for the historical data.
Finally, an event is emitted to notify downstream systems about the membership update. This allows other services and processes to react to the change and perform any necessary actions.
There are multiple benefits to this design:
Detailed Debugging: By maintaining a comprehensive history of membership data changes, the system enables detailed debugging and troubleshooting. Developers can trace the sequence of events and understand how the data evolved.
Event Replay and Reconciliation: The append-only nature of the log system allows for the ability to replay events. In case of data corruption or inconsistencies, the system can reconcile the data by replaying the events from a known good state.
Customer Service Analysis: The historical data captured by the Member History Service makes customer service analysis easy.
Technical Footprint of the Netflix Membership Platform
The technical landscape of the Netflix membership platform can be broadly categorized into two main areas: development and operations/monitoring.
Let's look at each area in detail.
Development Stack
The development stack of Netflix’s membership platform can be described by the following key points:
Netflix's architecture is optimized for handling high read requests per second (RPS) to support its massive user base.
The membership platform comprises over 12 microservices that communicate using gRPC at the HTTP layer. On a typical day, the platform can handle an impressive 3-4 million requests per second. To support this high volume, Netflix employs techniques like client-side caching at the gRPC level and in-memory caching of entire records to prevent CockroachDB from becoming a single point of failure.
The primary programming language used in the membership platform is Java with Spring Boot. However, in certain rewrite scenarios, Netflix is gradually transitioning to Kotlin.
Kafka plays a key role in message passing and interfacing with other teams, such as messaging and downstream analytics. This ensures smooth communication and data flow across different systems.
Netflix utilizes Spark and Flink for offline reconciliation tasks on their big data. These reconciliation jobs are crucial for maintaining data consistency and alignment between various systems of record within the membership platform, such as subscriptions and member history databases. The accuracy of data also extends to external systems, ensuring a consistent state across the entire ecosystem.
To ensure data consistency in online systems, Netflix employs lightweight transactions and uses databases like Cassandra. This approach guarantees the integrity and reliability of data across different services.
Operations and Monitoring
Netflix places a strong emphasis on observability and monitoring to ensure the smooth operation of its membership platform:
Extensive logging, dashboards, and distributed tracing mechanisms enable rapid error detection and resolution. In the complex microservice landscape of Netflix, these tools are essential for identifying and troubleshooting issues.
Production alerts are set up to track operational metrics and guarantee optimal service levels.
Operational data is leveraged to fuel machine learning models that enhance anomaly detection and enable automated issue resolution processes. All of this is done to try and maintain an uninterrupted streaming experience for the users.
Netflix utilizes tools like Kibana and Elasticsearch to create dashboards and analyze log data. In case of a spike in error rates, these dashboards allow the team to quickly identify the specific endpoint causing the issue and take corrective action.
Conclusion
In conclusion, Netflix's membership platform is a critical component of the company's success, enabling it to manage the entire lifecycle of a user's subscription. The platform has evolved from a simple, lightweight library to a robust, scalable architecture that can handle millions of requests per second
Some key takeaways to remember are as follows:
The membership platform is responsible for managing user signups, plan changes, renewals, and cancellations.
It utilizes a microservices architecture and makes use of databases like CockroachDB and Cassandra for storing membership data.
The platform captures and stores historical changes to membership data using CDC for debugging, event replay, and analytics.
References:
SPONSOR US
Get your product in front of more than 500,000 tech professionals.
Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.
Space Fills Up Fast - Reserve Today
Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing hi@bytebytego.com
Nice article.Gives a bird's view about netflix membership PF