Great read. This does a fantastic job explaining the hardware side of the AI revolution, especially why LLMs are fundamentally a hardware problem (data movement and linear algebra), not just a software one.
The CPU vs GPU contrast, memory wall discussion, and breakdown of Tensor Cores and HBM make it clear why this wave of progress was only possible now, and why this moment is special from a technology revolution standpoint. Same for TPUs and systolic arrays: extreme specialization, massive efficiency gains.
As humans, software is where we experience AI, but silicon is where the revolution is actually happening!
Great read, thanks for the writeup! Got a question on "While a standard core completes one floating-point operation per cycle, a Tensor Core executes a 4×4 matrix multiplication involving 64 individual operations (16 multiplies and 16 additions in the multiply step, plus 16 accumulations) instantly." Naively I'd expect 5 individual operations (4 multiplies and 1 addition) for each element in the result matrix, which sums up to 5*16 = 80. Is there any optimization steps that I'm missing?
Great read. This does a fantastic job explaining the hardware side of the AI revolution, especially why LLMs are fundamentally a hardware problem (data movement and linear algebra), not just a software one.
The CPU vs GPU contrast, memory wall discussion, and breakdown of Tensor Cores and HBM make it clear why this wave of progress was only possible now, and why this moment is special from a technology revolution standpoint. Same for TPUs and systolic arrays: extreme specialization, massive efficiency gains.
As humans, software is where we experience AI, but silicon is where the revolution is actually happening!
Great article
Great read, thanks for the writeup! Got a question on "While a standard core completes one floating-point operation per cycle, a Tensor Core executes a 4×4 matrix multiplication involving 64 individual operations (16 multiplies and 16 additions in the multiply step, plus 16 accumulations) instantly." Naively I'd expect 5 individual operations (4 multiplies and 1 addition) for each element in the result matrix, which sums up to 5*16 = 80. Is there any optimization steps that I'm missing?