3 Comments
User's avatar
Scenarica's avatar

"Shoes for pregnant women" returning slip-resistant shoes with zero keyword overlap. That single example is worth more than the entire technical explanation because it shows the gap in a way any product person can feel.

The 9% quality rate on co-purchase explanations is the number that should make every team building on LLMs uncomfortable. 91% of the model's reasoning about why people buy things together was either circular or obvious. That's not a data problem. That's the model confidently narrating a pattern it can describe but doesn't actually understand.

COSMO works because Amazon treated the LLM as a noisy ore mine, not an oracle. Generate millions of candidates, throw away 91% of them, validate the rest against human judgement, and only then build the knowledge graph. The engineering isn't in the generation. It's in the filtration.

Most teams skip the filtration step and ship the 91%.

Shank's avatar

The filtering step is where this gets interesting. LLMs generate millions of candidates but only 9-35% survive rule-based filters. That gap is basically an index of how much language models still hallucinate utility.

Mitchell Kosowski's avatar

The 30K annotations → 29M edges leverage ratio is the headline number but the unsung hero here is the annotation design: decomposing plausibility/typicality into 5 yes/no questions cut disagreement enough that the classifier could actually generalize downstream.

Precision-over-recall feels like the right call too (and honestly for most AI use cases). Ynreliable commonsense in prod would erode trust faster than missing knowledge ever does.

Curious how the team is thinking about closing the daily-refresh gap for time-sensitive intent like flash sales... feels like the next interesting frontier.