Hmm you can have both, eg. use open models (eg. llama, kimi) via API in a hosted remote environment like AWS Bedrock. This way you delay building out the infrastructure while using API, and avoid paying frontier model fees.
Renting open weights through Bedrock still means someone else solved the prefill-versus-decode batching for you, so you're buying inference engineering, just unbundled. The shakier part is assuming the open supply stays open: Meta spent three years as the largest open-source champion, then shipped its first proprietary model once returns mattered. Open weights were a catch-up strategy, and catch-up ends.
Best explanation of the prefill/decode split I've read: framing decode as memory-bandwidth-bound makes speculative decoding and quantization click instantly. Saving this as a reference. Thanks for writing it.
Great article about inference and putting it to the engineering lens. This is information that can be easily comprehended and used today. Thank you for making us better.
Hmm you can have both, eg. use open models (eg. llama, kimi) via API in a hosted remote environment like AWS Bedrock. This way you delay building out the infrastructure while using API, and avoid paying frontier model fees.
Renting open weights through Bedrock still means someone else solved the prefill-versus-decode batching for you, so you're buying inference engineering, just unbundled. The shakier part is assuming the open supply stays open: Meta spent three years as the largest open-source champion, then shipped its first proprietary model once returns mattered. Open weights were a catch-up strategy, and catch-up ends.
Best explanation of the prefill/decode split I've read: framing decode as memory-bandwidth-bound makes speculative decoding and quantization click instantly. Saving this as a reference. Thanks for writing it.
Great article about inference and putting it to the engineering lens. This is information that can be easily comprehended and used today. Thank you for making us better.