Discussion about this post

User's avatar
Atharva Shah's avatar

For anybody who wants to learn more about the recommendation algorithm itself, I wrote a detailed breakdown here: https://atharvashah.substack.com/p/how-675-million-spotify-users-get

Expand full comment
Neural Foundry's avatar

The beauty of this story is how cleanly it's laid out versus how messy the real thing typically feels. The architecture is fantastic - using events as the primary object, Platform as Code, using Backstage as the entry point, Scio/Beam for portability - but the harder part of building an organization at this scale is not having 50+ teams treat schema as suggestions and "adding just another field" as a habit.

There are references throughout to self-serve; however, the only way self-serve will work is if you have strong opinions about your default settings, or someone willing to say "no." Otherwise you'll find yourself waking up to 1800 variations of "play_event_v2_final," and nobody will trust anything. Additionally, the compliance/privacy aspect is nearly glossed over: engineering and process-wise, being able to include anonymization and key handling in the event path is a significant win, and that is where most organizations quietly compromise until a regulatory body shows up.

I would enjoy seeing the next post spend more time on the "ugly" parts of the buildout (bad producer problems, schema drift issues, backfills that fail, etc.) and how teams that refused to clean-up their junk pipelines because "the dashboard is dependent upon it", etc. As a blueprint for a serious data platform, when it has grown up, this is probably one of the clearest write-ups I've read.

Expand full comment

No posts

Ready for more?