3 Comments
User's avatar
Prashanth Manohar's avatar

We’re dealing with GPU workloads, so the problem is quite different from CPU workers. Instead of relying on warm instances, we snapshot the full CPU and GPU execution state, so restore behaves like a resume rather than re-initialization.

https://github.com/inferx-net/inferx

Jorrit's avatar

Very Interesting take. I have written this article about Cloudlfare and would be happy if you can check it out :)

https://marketsminds.substack.com/p/deep-dive-cloudflare-the-internets?utm_campaign=post-expanded-share&utm_medium=post%20viewer

Pawel Jozefiak's avatar

Strong point, and it matches my own tests. I tested Sonnet 4.6 on real coding and long context workflows, and the pattern was consistent: results improved when I enforced agent handoff boundaries and explicit success checks. When context got noisy, reliability dropped even if raw speed looked good. I documented both experiments, including a personal context test, here: https://thoughts.jock.pl/p/sonnet-46-two-experiments-one-got-personal

Have you found a reliable way to catch silent failures early? I would love to compare notes on your exact setup.