6 Comments
User's avatar
Prasanna Narayanan's avatar

I had no interest in this topic but once I started reading through I couldn't stop. What an fantastic way to explain complex concept! Truly one of the best newsletter I subscribe to.

Expand full comment
Neural Foundry's avatar

The systolic array explanation here is really clear. The weight-stationary design solves the memory bottleneck that kills most accelerator performance. I remember working with an older GPU setup that was constantly starved for bandwidth, and switching to a systolic architecture cut inference latency by more than half in production. That "data reads once but gets used thousands of times" part matters way more than raw FLOPS

Expand full comment
Neural Foundry's avatar

The systolic array bucket-brigade analogy is perfct for explaining why data locality matters more than raw clock speed. I've debugged enough CUDA kernels to know that memory bandwidth kills performance way before compute does. What's clever about the weight-stationary desgin is eliminating the round-trip to memory entirely for the inner loop. Once you're doing 65k MACs per cycle at 700MHz with minimal data movement, suddenly training models at scale stops being economically impossible.

Expand full comment
Aaron G's avatar

If the evolution of custom AI hardware like Google’s TPU accelerates both the scale and ubiquity of large language models, how might that reshape human behavior? I think Americans can already be very demanding, yet this might normalize expectations of instantaneous, machine-mediated reasoning and decision-making in everyday life. Will people eventually select regularly from several machine-generated conclusions?

Expand full comment
Laura Brindusa Squire's avatar

Ty for writing this

Expand full comment
Rakia Ben Sassi's avatar

A hot topic. Thanks for tackling it!

Expand full comment