This week’s system design refresher:
Stack overflow architecture
iQIYI database selection trees
Latency Numbers Every Programmer Should Know for the 2020s
Row-based DB vs. Column-based DB
HTAP Summit 2022 is coming soon! (Sponsored)
We’re talking about HTAP Summit 2022, the very first in-person conference on Hybrid Transactional / Analytical Processing. This promises to be a disruptive technology in the database world. So, dive in and discover more about this emerging tech!
Hear from 30+ database industry leaders and developers from companies and universities, such as Amazon, Databricks, Forrester, Block, Pinterest, PingCAP, Vercel, UW-Madison, UC-Berkeley, and many more.
The Best part? It’s FREE.
Date: November 1 at the Computer History Museum, Mountain View, Bay Area, CA.
HTAP Summit 2022 organized by PingCAP features 30+ content-rich sessions on HTAP databases, including core infrastructure technologies, use cases, best practices, ecosystem, hands-on workshops, and keynotes.
How will you design the Stack Overflow website?
If your answer is on-premise servers and monolith, you would likely fail the interview, but that's how it is built in reality!
What people think it should look like
The interviewer is probably expecting something on the left side.
Microservice is used to decompose the system into small components.
Each service has its own database. Use cache heavily.
The service is sharded.
The services talk to each other asynchronously through message queues.
The service is implemented using Event Sourcing with CQRS.
Showing off knowledge in distributed systems such as eventual consistency, CAP theorem, etc.
What it actually is
Stack Overflow serves all the traffic with only 9 on-premise web servers, and it’s on monolith! It has its own servers and does not run on the cloud.
This is contrary to all our popular beliefs these days.
iQIYI database selection trees
One picture is worth a thousand words.
iQIYI is one of the largest online video sites in the world, with over 500 million monthly active users. Let's look at how they choose relational and NoSQL databases.
The following databases are used at iQIYI:
MySQL
Redis
TiDB: a hybrid transactional/analytical processing (HTAP) distributed database
Couchbase: distributed multi-model NoSQL document-oriented database
TokuDB: open-source storage engine for MySQL and MariaDB.
Big data analytical systems, like Hive and Impala
Other databases, like MongoDB, HiGraph, and TiKV
The database selection trees below explain how they choose a database.
Latency Numbers Every Programmer Should Know for the 2020s
This concept was originally presented by Jeff Dean. We updated some of these numbers to more closely reflect reality in the 2020s. Absolute accuracy is not the goal. Developing an intuition of the relative differences is.
Why do we use column-based DB? Does column-based DB provide better performance?
The diagram below shows how data is stored in column-based DB.
When to use
The table is a wide table with many columns.
The queries and calculations are on a small number of columns.
A lot of the columns contain a few distinct values.
Benefits of column-based DB
Higher data compression rates.
Higher performance on OLAP functions.
No need for additional indexes
Got behavioral interviews? (Sponsored)
"Tell me about a time when..." Sometimes, the toughest interview questions aren't the technical ones. For behavioral interviews, RocketBlocks is here to help. Trusted by leading institutions like Stanford GSB and MIT Sloan.
Would like to see a further breakdown of the Stack Overflow system. I feel it may be oversimplified in the diagram, for instance, there must be a reverse proxy to distribute traffic between the different web servers?
Can we have a further detailed evaluation of pros and cons of wide column DB, and which use case is the best to use them? Thanks