Discussion about this post

User's avatar
Yuzu Xu's avatar

The convergence on simulation-based LLM eval is happening independently in Chinese AI deployments too, for structurally different reasons. Meituan, Alibaba, and JD are running LLM-powered customer service at 100M+ daily interactions, where the reliability bar is much higher than chatbot tolerance. A failed food delivery agent reroute costs the merchant real money, not just seconds of conversation quality. What Chinese engineering teams describe is weighting failure-mode recovery heavily in simulation over success-path accuracy. The DoorDash approach using historical transcripts as the seed for scenario generation maps exactly to what Alibaba's teams have described publicly. The interesting divergence: Chinese deployments skew more toward adversarial scenario injection in simulation because the consequence of agent failure in consumer contexts is immediate and measurable.

Dan Kinsky's avatar

Oh wow, it's a spin on how Generative Adversarial Networks (GANs). They create better and better images (chats) by having one AI (a Generator) create them and another (a Discriminator) judge whether it's sufficiently good to fool them. I wonder why they only use this for testing and not go all the way to have it be part of the training like with GANs?

No posts

Ready for more?