Discussion about this post

User's avatar
Scenarica's avatar

"Shoes for pregnant women" returning slip-resistant shoes with zero keyword overlap. That single example is worth more than the entire technical explanation because it shows the gap in a way any product person can feel.

The 9% quality rate on co-purchase explanations is the number that should make every team building on LLMs uncomfortable. 91% of the model's reasoning about why people buy things together was either circular or obvious. That's not a data problem. That's the model confidently narrating a pattern it can describe but doesn't actually understand.

COSMO works because Amazon treated the LLM as a noisy ore mine, not an oracle. Generate millions of candidates, throw away 91% of them, validate the rest against human judgement, and only then build the knowledge graph. The engineering isn't in the generation. It's in the filtration.

Most teams skip the filtration step and ship the 91%.

Shank's avatar

The filtering step is where this gets interesting. LLMs generate millions of candidates but only 9-35% survive rule-based filters. That gap is basically an index of how much language models still hallucinate utility.

1 more comment...

No posts

Ready for more?