9 Comments

why are both kafka and redis required here. wouldnt one be enough

Expand full comment

I don't quite understand how the jobs are "dispatched" from Redis to the workers. Wouldn't workers need to poll the Redis cache in-order to figure out the available jobs? Can someone please explain? And also it is mentioned that it is debatable whether cron was the right choice for Slack. What other possible options Slack could have considered?

Expand full comment

I had the same question in mind, I wonder if they went for pub/sub with Redis (https://redis.io/docs/latest/develop/interact/pubsub/) but yeah a clarification would be nice.

Expand full comment

it because historical reason of evolving thir job queue system from Redis to Kafka+Redis

you can check this - https://slack.engineering/scaling-slacks-job-queue/

Expand full comment

They might have used something like redisGears to fulfill this

Expand full comment

Yes that might be possible. Thanks for the pointer!

Expand full comment

Good to know

Expand full comment
Jun 6·edited Jun 6

This mentions they don't have to keep pods in sync due to having a single leader, which makes sense from the execution perspective, but how do they ensure a shared consistent schedule across pods? Do they pull the whole schedule from the DB after a pod is elected as leader? I wonder how that would work for millions of scheduled jobs

Expand full comment

> While this approach worked fine, there were cases where a script’s execution time exceeded its recurrence intervals, leading to the possibility of two copies running concurrently

> they used flock

How is that? Flock does exactly that - prevents other processes from running if the lock is not released. If the other script is starting it would either wait for the lock or skip running depending on configuration

Expand full comment