EP114: 7 Must-know Strategies to Scale Your Database

Jun 01, 2024

This week’s system design refresher:

7 must-know strategies to scale your database
How do we retry on failures?
Reddit’s Core Architecture
Everything You Need to Know About Cross-Site Scripting (XSS)
SPONSOR US

Your ultimate guide to integrating email, calendars & contacts (Sponsored)

Launch native email, calendar, and contacts capabilities with the greatest possible ROI. This latest guide from Nylas walks you through the most common options to launch these integrations for all major email and calendar service providers (Gmail, Outlook, IMAP, etc.) including APIs vs. building yourself. Read on to discover best practices and:

How complex it is to build the email, calendar and contacts integration from scratch
The true cost of building your own email, calendar, contacts integration
6 Questions for CTOs and product managers to future-proof their business

If you're interested in trying out an API that integrates these email and calendar service providers for you, check out Nylas.

No alternative text description for this image

Indexing:
Check the query patterns of your application and create the right indexes.
Materialized Views:
Pre-compute complex query results and store them for faster access.
Denormalization:
Reduce complex joins to improve query performance.
Vertical Scaling
Boost your database server by adding more CPU, RAM, or storage.
Caching
Store frequently accessed data in a faster storage layer to reduce database load.
Replication
Create replicas of your primary database on different servers for scaling the reads.
Sharding
Split your database tables into smaller pieces and spread them across servers. Used for scaling the writes as well as the reads.

Over to you: What other strategies do you use for scaling your databases?

How do we retry on failures?

In distributed systems and networked applications, retry strategies are crucial for handling transient errors and network instability effectively. The diagram shows 4 common retry strategies.

Linear Backoff
Linear backoff involves waiting for a progressively increasing fixed interval between retry attempts.

Advantages: Simple to implement and understand.

Disadvantages: May not be ideal under high load or in high-concurrency environments as it could lead to resource contention or "retry storms".
Linear Jitter Backoff
Linear jitter backoff modifies the linear backoff strategy by introducing randomness to the retry intervals. This strategy still increases the delay linearly but adds a random "jitter" to each interval.

Advantages: The randomness helps spread out the retry attempts over time, reducing the chance of synchronized retries across instances.

Disadvantages: Although better than simple linear backoff, this strategy might still lead to potential issues with synchronized retries as the base interval increases only linearly.
Exponential Backoff
Exponential backoff involves increasing the delay between retries exponentially. The interval might start at 1 second, then increase to 2 seconds, 4 seconds, 8 seconds, and so on, typically up to a maximum delay. This approach is more aggressive in spacing out retries than linear backoff.

Advantages: Significantly reduces the load on the system and the likelihood of collision or overlap in retry attempts, making it suitable for high-load environments.

Disadvantages: In situations where a quick retry might resolve the issue, this approach can unnecessarily delay the resolution.
Exponential Jitter Backoff
Exponential jitter backoff combines exponential backoff with randomness. After each retry, the backoff interval is exponentially increased, and then a random jitter is applied. The jitter can be either additive (adding a random amount to the exponential delay) or multiplicative (multiplying the exponential delay by a random factor).

Advantages: Offers all the benefits of exponential backoff, with the added advantage of reducing retry collisions even further due to the introduction of jitter.

Disadvantages: The randomness can sometimes result in longer than necessary delays, especially if the jitter is significant.

Latest articles

If you’re not a paid subscriber, here’s what you missed.

To receive all the full articles and support ByteByteGo, consider subscribing:

Reddit’s Core Architecture

A quick look at Reddit’s Core Architecture that helps it serve over 1 billion users every month.

graphical user interface, application, Teams

This information is based on research from many Reddit engineering blogs. But since architecture is ever-evolving, things might have changed in some aspects.

The main points of Reddit’s architecture are as follows:

Reddit uses a Content Delivery Network (CDN) from Fastly as a front for the application
Reddit started using jQuery in early 2009. Later on, they started using Typescript and have now moved to modern Node.js frameworks. Over the years, Reddit has also built mobile apps for Android and iOS.
Within the application stack, the load balancer sits in front and routes incoming requests to the appropriate services.
Reddit started as a Python-based monolithic application but has since started moving to microservices built using Go.
Reddit heavily uses GraphQL for its API layer. In early 2021, they started moving to GraphQL Federation, which is a way to combine multiple smaller GraphQL APIs known as Domain Graph Services (DGS). In 2022, the GraphQL team at Reddit added several new Go subgraphs for core Reddit entities thereby splitting the GraphQL monolith.
From a data storage point of view, Reddit relies on Postgres for its core data model. To reduce the load on the database, they use memcached in front of Postgres. Also, they use Cassandra quite heavily for new features mainly because of its resiliency and availability properties.
To support data replication and maintain cache consistency, Reddit uses Debezium to run a Change Data Capture process.
Expensive operations such as a user voting or submitting a link are deferred to an async job queue via RabbitMQ and processed by job workers. For content safety checks and moderation, they use Kafka to transfer data in real-time to run rules over them.
Reddit uses AWS and Kubernetes as the hosting platform for its various apps and internal services.
For deployment and infrastructure, they use Spinnaker, Drone CI, and Terraform.

Over to you: what other aspects do you know about Reddit’s architecture?

Everything You Need to Know About Cross-Site Scripting (XSS)

XSS, a prevalent vulnerability, occurs when malicious scripts are injected into web pages, often through input fields. Check out the diagram below for a deeper dive into how this vulnerability emerges when user input is improperly handled and subsequently returned to the client, leaving systems vulnerable to exploitation.

Understanding the distinction between Reflective and Stored XSS is crucial. Reflective XSS involves immediate execution of the injected script, while Stored XSS persists over time, posing long-term threats. Dive into the diagrams for a comprehensive comparison of these attack vectors.

Imagine this scenario: A cunning hacker exploits XSS to clandestinely harvest user credentials, such as cookies, from their browser, potentially leading to unauthorized access and data breaches. It's a chilling reality.

But fret not! Our flyer also delves into effective mitigation strategies, empowering you to fortify your systems against XSS attacks. From input validation and output encoding to implementing strict Content Security Policies (CSP), we've got you covered.

Over to you: How can we amplify user awareness to proactively prevent falling victim to XSS attacks? Share your insights and strategies below! Let's collaboratively bolster our web defenses and foster a safer digital environment.

SPONSOR US

Get your product in front of more than 500,000 tech professionals.

Our newsletter puts your products and services directly in front of an audience that matters - hundreds of thousands of engineering leaders and senior engineers - who have influence over significant tech decisions and big purchases.

Space Fills Up Fast - Reserve Today

Ad spots typically sell out about 4 weeks in advance. To ensure your ad reaches this influential audience, reserve your space now by emailing hi@bytebytego.com

Donald Parish

Jun 2, 2024

Re “7 must-know strategies to scale your database” the Denormalization is really a reorganization of 3NF DB to dimensional model with normalized fact tables. 3NF to get data in, dimensional model (star schema) to get the data out.

Alex Pliutau

Jun 10, 2024

In some cases having a database proxy might help (if you're ready to invest in it), we recently wrote about that - https://packagemain.tech/p/the-developers-guide-to-database

1 more comment...

ByteByteGo Newsletter

Discussion about this post

Ready for more?