Discussion about this post

User's avatar
Pawel Jozefiak's avatar

Interesting how this differs from what I saw at the Mistral hackathon last week - their architecture pitch was heavy on MoE efficiency gains, but the actual feel in practice lagged behind the spec sheet.

The gap between architectural novelty and real model behavior is wider than benchmark scores suggest. DeepSeek training for $5.5M while Kimi K2 activates 32B params per token from 1T total - these numbers tell a compelling story about diverging bets.

I wrote about running Mistral head-to-head against other models at the EU hackathon (https://thoughts.jock.pl/p/mistral-ai-honest-review-eu-hackathon-2026) - the architectural gap shows up exactly where the paper predicts it would.

No posts

Ready for more?