Discussion about this post

User's avatar
Scenarica's avatar

The billion predictions per second is impressive but it's actually the easier constraint to solve. Thats parallelism. The harder constraint is the 100ms budget for any single prediction, because latency is an architecture problem and you cant parallelise your way out of it.

Everything downstream, the retrieval/ranking split and the compute graph separation between GPU and CPU, is shaped by that single number. The system's architecture is really a latency budget expressed as infrastructure.

No posts

Ready for more?