How Discord Serves 15-Million Users on One…

Jan 9, 2024

Measuring GenAI Code’s Impact: Free Workshop (Sponsored) How is GenAI impacting software development? Join LinearB and ThoughtWorks’ Global Lead for AI Software Delivery to explore the metrics showing AI’s impact, unpack best practices for leveraging AI in software development, and measure the ROI of your own GenAI initiative.

Read →

6 Comments

Raul_suc

Jan 10, 2024

Regarding the passive fanout, how are the passive connection and active connection detected?

Does the session process handle it? Does the client send a message to guild/session when it is turned from foreground to background?

Expand full comment

Aken Roberts

Jan 10, 2024

Excellent case study! I love the "Maxjourney" internal group name 😁

Expand full comment

Aske BV

Jan 9, 2024

You might be interested in C10M. It's the evolution of the 90's C10K problem, which is about dealing with large concurrent connections on servers. C10M was solved on a $5k PC in 2013. We've had a lot of hardware advances since 2013, and servers are more powerful than a $5k PC, so 15M shouldn't be the limit. A machine where all but 2 cores has no interface with a standard OS, and with no black-box library calls, you could do a lot more. Especially if you can utilize SIMD. But great job getting this far.

Expand full comment

Reply (1)

Artem

Jan 10, 2024

yes but it was solved on one machine I presume where all communication is near instant, using Inter Process Communication?

The challenges in real world distributed system is achieving Fault Tolerance, having multiple separate machines store copies of data - the communication between DIFFERENT machines takes much more time, hence a much bigger problem for Availability.

In the test you mentioned - what would happen if the single C10M computer suddenly crashed or there was a network issue ?

Expand full comment

Reply (1)

Aske BV

Jan 10, 2024

Why would you presume that it's 10M loopback connections? Start with the fundamentals of what a NIC is capable of, and what is required to achieve 10M concurrent connections with that limitation. Remember that the CPU is much faster than the NIC, as long as you minimize wasted instructions and L1 cache misses, utilize its SIMD and IPC capabilities - and parallelize for performance, especially on servers (that usually have many cores). Also remember there's a huge difference between latency and throughput.

What test did I mention?

Your last question seems to imply that there'll be no redundancy and/or lots of unhandled failure modes or blackbox functionality where you can't reason about it - which isn't the case. That's the primary point of removing external dependencies - you know exactly what it's doing and can then profile it enough to know how it runs on the target hardware.

Expand full comment

Reply (1)

Artem

Jan 10, 2024

"C10M was solved on a $5k PC in 2013" is what you mentioned. I'm just asking whether this was done on a single machine handling concurrency via IPC?

What happens if this machine dies or there is a network partition? How will we continue serving the client connections?

Expand full comment

ByteByteGo Newsletter

How Discord Serves 15-Million Users on One…