Hi there,
In another thread, Ryan you mentioned:
“The only thing that has changed in version 6 is the additional requirement of having the database on line as well. However, replica sets can be used to back up the database and automatically fail over if the primary database goes down. We also no longer need Pulse as a proxy for larger farms, since the database can handle the load. So the slaves still work independently, just like they did in v5, and we really haven’t seen a need to change this.”
I didn’t want to hijack that thread, so starting a new one here.
First off, pulse is a requirement for larger farms, its not optional. If/when pulse goes down for us, all dependency handling and cleanup essentially stops working. If we were using the current python api, it also requires pulse to be running. Its an extremely critical part of even a medium complexity deadline setup, basically becoming the weakest link (fingers still crossed for a native, pulse independent python api…).
But the main reason for me to start this thread is… Since we now have the requirement to have 3 central services constantly operational (mongo, repository and pulse), at this point, a central dispatcher mechanism is not a stretch.
Why would we need one though? Deadline seems to operate “just fine” with the current setup. Which is the main issue we are seeing. We would like it to operate perfectly, not just fine.
We are struggling to get a farm cpu utilization higher than 25-30% out of the current setup, and it usually comes down to the inability to do ‘clever’ task assignment behaviors.
A couple of examples:
- we have sim jobs that require sequential renderings of frames by a single machine. Ideally, this is the same machine, as every machine swap means: restarting max + reloading the scene, sometimes adding 5-10+ minutes per frame. Without a central dispatcher its pretty much impossible to make sure a machine stays locked to its job, as it can at any time find another higher priority task. It can’t make an ‘overview’ decision. Its common that 2 sim jobs keep swapping their machines… This problem is most visible with sim jobs, but its affecting every single job. Machines run randomly around from job to job, fully reinitializing. We have too many jobs per day in the queue for there to be static idle times where things “just cook”.
- Minimizing idle machine counts, by dynamically adjusting machine limits on jobs. We run into this often, especially late at night. Its annoying to come in in the morning and have 90% of the farm be idle, while there are 50+ jobs at machine limit of 8… We now have a night shift of wranglers simply due to this issue ($$).
- Detecting ideal machines for jobs, moving tasks between jobs based on performance
- Spawning new slaves on machines that have a percentage of their cpus unused, based on what queued jobs might require
With the monolithic slave application approach, all of these require hackery, and some can’t be solved at all (with the current feature set).
Anyway, opinions?