Celery for scheduled jobs seem to not be a supported design pattern at all, and ...

BiteCode_dev · on May 29, 2020

I don't see how you came to this idea. The jobs can be as long as you want, you can have retry, persistant queues, priorities, and dependancies.

Of course, I would advice to put a dedicated queue for very long running tasks, and set worker_prefetch_multiplier to 1 as the doc recommand for long running tasks: https://docs.celeryproject.org/en/stable/userguide/optimizin...

With flowers (https://flower.readthedocs.io/en/latest/), you can even monitor the whole thing or deal with it manually.

I assume your comment is reporting on other comments, but not direct experience?

ramraj07 · on May 29, 2020

Direct experience very fresh in memory :)

The issue with long running tasks is that you have to change the timeout to longer than the default value of one hour (otherwise the scheduler assumes the job is lost and requeues it). But this is a global parameter across all queues so this means we essentially loose the one good feature of celery for small tasks which is retrying lost tasks within some acceptable timeframe.

Further flower seems weird - half the panels don't work when connecting through our servers; our vpc settings are a bit bespoke but not completely out there, so it's not fully useful. Also flower only keeps track of tasks queued after you start the dashboard (but then it accumulates a laundry list of dead workers across deployments if you keep it running continuosly).

We were also excited to use it's chaining and chord features but went into a series of bugs we couldn't dig ourselves out of when tasks crashed inside a chord (went into permanent loops). I just declared bankruptcy on these features and we implemented chaining ourselves.

Point is, I'm sure we got some parameters wrong, but I and another engineers spent WEEKS wrangling with celery to at least get it running somewhat acceptably. That seems a bit too much. We are not L10 Google engineers for sure but we aren't stupid either. The only stupid decision we made was probably choosing celery from what I can see.

In the end we still keep celery for the on demand async tasks that run in a few minutes. For scheduled tasks that run weekly, we just implemented our own scheduler (that runs in the background in our webservers in the same elastic beanstalk deployment) that uses regular rdbms backend and does things as we want. Turns out it's just a few hundred lines of simple python.

BiteCode_dev · on May 29, 2020

Fair enough and very honest.

> But this is a global parameter across all queues so this means we essentially loose the one good feature of celery for small tasks which is retrying lost tasks within some acceptable timeframe.

Oh, for this you just setup two celery deamon, each one with their own queues and config. I usually don't want my long running task on the same instance than the short ones anyway.

> We were also excited to use it's chaining and chord features but went into a series of bugs we couldn't dig ourselves out of when tasks crashed inside a chord (went into permanent loops). I just declared bankruptcy on these features and we implemented chaining ourselves.

Granted on that one, they not the best part of celery.

Just out of curiosity, which broken and result backend did you use for celery?

I mostly use Redis as I had plenty of problems with Rabbit MQ, and wonder if you didn't have those because of it.

ramraj07 · on May 29, 2020

Our use case was that the timescale of any of our tasks (depending how complex a query the user makes) can go from 1 minute to 45 minutes. We demoed a new task that occassionally went over the 1 hour mark. It's definitely annoying to have separate queues for these tasks but that might be what we need to do!

We use redis. FWIW within the narrow limits of the task properties it's remarkably stable,so no complaints on that!

sillycube · on May 29, 2020

Yes, I feel the same after spending several weeks to deal with celery, redis, docker compose config, flower, set up celery workflow, rate limit, worker, etc. Testing for a really long time...

When there is a workflow to play with chord & chain, it became unintuitive to find out the issue. I was stuck in an issue and finally I posted on SO to ask for help. Luckily I got an answer

I thought it's my problem due to no experience in scheduling stuffs. I hope there is something which is simpler