4 Comments
User's avatar
Anny He's avatar

Hmm you can have both, eg. use open models (eg. llama, kimi) via API in a hosted remote environment like AWS Bedrock. This way you delay building out the infrastructure while using API, and avoid paying frontier model fees.

The Synthesis's avatar

Renting open weights through Bedrock still means someone else solved the prefill-versus-decode batching for you, so you're buying inference engineering, just unbundled. The shakier part is assuming the open supply stays open: Meta spent three years as the largest open-source champion, then shipped its first proprietary model once returns mattered. Open weights were a catch-up strategy, and catch-up ends.

Mitchell Kosowski's avatar

Best explanation of the prefill/decode split I've read: framing decode as memory-bandwidth-bound makes speculative decoding and quantization click instantly. Saving this as a reference. Thanks for writing it.

Menoko OG-Original Geek's avatar

Great article about inference and putting it to the engineering lens. This is information that can be easily comprehended and used today. Thank you for making us better.