central dispatcher

LaszloSebo · June 18, 2014, 10:49pm

Hi there,

In another thread, Ryan you mentioned:

“The only thing that has changed in version 6 is the additional requirement of having the database on line as well. However, replica sets can be used to back up the database and automatically fail over if the primary database goes down. We also no longer need Pulse as a proxy for larger farms, since the database can handle the load. So the slaves still work independently, just like they did in v5, and we really haven’t seen a need to change this.”

I didn’t want to hijack that thread, so starting a new one here.

First off, pulse is a requirement for larger farms, its not optional. If/when pulse goes down for us, all dependency handling and cleanup essentially stops working. If we were using the current python api, it also requires pulse to be running. Its an extremely critical part of even a medium complexity deadline setup, basically becoming the weakest link (fingers still crossed for a native, pulse independent python api…).

But the main reason for me to start this thread is… Since we now have the requirement to have 3 central services constantly operational (mongo, repository and pulse), at this point, a central dispatcher mechanism is not a stretch.

Why would we need one though? Deadline seems to operate “just fine” with the current setup. Which is the main issue we are seeing. We would like it to operate perfectly, not just fine.
We are struggling to get a farm cpu utilization higher than 25-30% out of the current setup, and it usually comes down to the inability to do ‘clever’ task assignment behaviors.

A couple of examples:

we have sim jobs that require sequential renderings of frames by a single machine. Ideally, this is the same machine, as every machine swap means: restarting max + reloading the scene, sometimes adding 5-10+ minutes per frame. Without a central dispatcher its pretty much impossible to make sure a machine stays locked to its job, as it can at any time find another higher priority task. It can’t make an ‘overview’ decision. Its common that 2 sim jobs keep swapping their machines… This problem is most visible with sim jobs, but its affecting every single job. Machines run randomly around from job to job, fully reinitializing. We have too many jobs per day in the queue for there to be static idle times where things “just cook”.
Minimizing idle machine counts, by dynamically adjusting machine limits on jobs. We run into this often, especially late at night. Its annoying to come in in the morning and have 90% of the farm be idle, while there are 50+ jobs at machine limit of 8… We now have a night shift of wranglers simply due to this issue ($$).
Detecting ideal machines for jobs, moving tasks between jobs based on performance
Spawning new slaves on machines that have a percentage of their cpus unused, based on what queued jobs might require

With the monolithic slave application approach, all of these require hackery, and some can’t be solved at all (with the current feature set).

Anyway, opinions?

nrusch · June 19, 2014, 12:54am

Since I’m the one who asked whether there was a plan to eventually move in that direction, you can probably guess what my opinion is on the matter.

As you alluded to, Deadline already requires a central dispatcher in a way… it’s just a passive data store whose logic has been distributed across all of the slaves. Unfortunately, this leaves the door open for less-than-optimal decision making on the part of the slaves, and limits the complexity and timeliness of the scheduling decisions that can be made (the simplest example being scanning for jobs instead of just being told to do work).

If the scheduling logic were moved into Pulse (making it the dispatcher), all job and task status updates were routed through it, and the Slave was stripped down to a pure task execution engine and event handler, I think the overall complexity of Deadline would drop dramatically, and its reliability and efficiency would increase. The serving of logs could even be broken out into its own process to help keep the complexity of the dispatcher down.

It may be blasphemy to speak of other queue managers here, but I think it’s worth metioning that Tractor’s system of slots and “service keys” for slaves has, to me, set the current bar for task assignment (to say nothing of their graph-based task execution model), and the forthcoming update will continue improving on those features. I think it’s safe to say that very little of what they’re doing would be possible without a (highly efficient) central dispatcher.

rrussell · June 19, 2014, 1:20pm

I think a lot of what you’re asking for could be handled without a central scheduler. See my comments below.

I think I solution to this problem would be to improve how the slaves handle Housecleaning and Pending Job Scans when Pulse goes down. Currently, each slave has a 1/N chance of performing these operations between tasks, where N is the number of currently active slaves. A better approach might be to store the last time these operations were performed in the database, and then have a Slave perform the operation when a certain interval has passed. We would still use the locking mechanism as well to ensure that no more than one slave can perform these operations at a time. This should make the system more robust if Pulse goes down, and allow Deadline to handle these operations more reliably if a studio chooses not to run Pulse.

We’ve also been considering separating the web service from Pulse and make it its own application, as well as building some sort of Pulse redundancy system.

What we could do here is tweak how the “Sequential” property for a job works. Currently, it ensures that a slave stays on the same job, and renders its tasks in order, but only if there isn’t a higher priority job based on Pool and Priority (not submit date time). However, under a weighted or balanced scheduling algorithm, it doesn’t really help at all. The tweak could be to have a slave stick with a sequential job as long as it has tasks, regardless of any higher priority jobs on the farm.

I thought one of main reasons for adding the weighted scheduling system was that you wouldn’t have to use machine limits like this. Regardless, I don’t think a system like this requires a central scheduler.

Can you elaborate on this example some more? It sounds like you would want Deadline to kill a task on one machine and move it to another if the original machine is taking too long. I’m concerned that could result in a lot of wasted rendering time…

We are currently working on the early design phases of a slot/block system where a single slave could process different jobs depending on how many “blocks” it has free. It’s too early to get into any details at this point, but again, it’s something I’m sure we can accomplish without a central scheduler.

Cheers,
Ryan

LaszloSebo · June 19, 2014, 4:46pm

Yes, this would definitely help. I do remember having seen issues with the locking mechanism though, and seeing several slaves being stuck doing housecleaning. Its hard to make the lock mechanism robust when slaves can randomly be taken offline, requeued, rebooted etc. These are critical services that require a stable server, not a random volatile farm machine.

Definitely has my vote. When we last tried the webservice, we could crash pulse (and thus the farm) by 2 clicks via an android phone.

It would definitely help with sim jobs, which we actually want the slaves to stick to. Previously on assburner, we simply assigned a large chunk size, and let it cook that way. We can’t do that with deadline since the chunks are grouping the frames (in assburner the frames were kept separate), so you can’t individually requeue frames easily, or if a frame fails in a chunk it requeues the whole group.
However, this would not help with the problem of ‘regular’ jobs, where the jumping around behavior is just as prevalent, but less obvious. When you have 1000+ slaves, its hard to keep track of the fact that machine X wasted 90% of its performance on jumping around jobs, instead of just remaining where it is.
Without the full context of whats going on on the farm, a slave can’t decide whether its a smart idea to stick to what its been doing (because another slave will simply render a potentially higher priority task), or just go to that higher priority task like sheep.

Yes it definitely helps with that, but during the day & crunch periods (11-12pm and 5-7pm where most jobs are submitted) we need limits based on job types. Certain jobs need to slow cook while the farm is full.

What i meant was making decisions of dequeuing based on average render times of machines. If machine X rendered a couple frames and it took 4x longer than neighbouring machines, a central dispatcher could weigh these slaves for assignment. Individual machines can’t make such decisions, without adding ‘random’ weights to whether they pick a task or not assignment.

This is great to hear!!

While as you said, most of these use cases can be solved to some degree without a central dispatcher (other than #1, which i think is a critical issue, but since there are no easy ways to measure the wasted rendertime, its hard to argue… might be a nice idea to start measuring somehow, map out the ‘history tree’ of task assignments, and then compare rendertimes and time wasted on jumping around by similar slaves), a central dispatcher would make all of these much much easier to handle.

rrussell · June 19, 2014, 5:59pm

This can become an issue when using the balanced algorithm, as well as the weighted algorithm (if you apply weight to the number of rendering tasks). In order to maintain the “balance”, the slaves need to jump to the job that has the least number of rendering tasks. It sounds like what is needed is a way for a slave to give some preference to the job it’s already working on, but in a way that doesn’t circumvent the expected priority. I’m sure the same sort of logic would have to be applied by a central scheduler as well…

We’ll do some thinking here, but if you have any thoughts on how weight could be applied to the current job, we’re all ears.

Could you maybe use pools for this? Have a “slow-cooker” pool or something that is at top priority for a handful of machines but at a lower priority for others? That way, some of these jobs will still get rendered during the day, but then they can expand to the rest of the farm when things start clearing out.

rrussell · June 19, 2014, 6:16pm

We had a little discussion, and we think we have a solution:

For the balanced algorithm, we would add a task count buffer. Basically, it would subtract that buffer from the number of rendering tasks for the current job the slave is working on, which effectively pushes its place up in the dequeuing order. This would reduce the jumping around because a slave would only drop its current job if there is another job in the queue that has less rendering tasks when taking the buffer into account.
For the weight algorithm, we would add a fifth weight option that would be given to the current job, which is simply added to the overall job weight. Again, this will bump it up in the queue order, and if set appropriately relative to the Render Task weight, it should give you the same behavior as above for the balanced algorithm.

Thoughts?

LaszloSebo · June 19, 2014, 7:09pm

Both of these ideas try to work out the problem and might help somewhat while increasing complexity of dequeuing further. A central dispatcher can look at the big picture and say: “ok, both machine A and B could move to either job X or Y. I’ll keep A on X and B on Y, as thats the one they have been rendering already and the preference is to avoid jumping around.”

I don’t think you can make such easy decisions without a central dispatcher, and it will be inevitable that machines will be moving around quiet frequently.

Adding weights to the current job will displace the balance of the actual priorities set using the weights, but short of having such an overseer, i would take both of these suggestions.

The reason i started the thread though is that i think you would have a much easier time designing better, more production friendly and complex dequeuing logic with a central dispatcher that has the ‘full picture’, compared to the current approach of trying to balance weights, priorities, pools, pool orders, secondary pools, groups, limit groups, machine limits etc.

rrussell · June 19, 2014, 8:14pm

Thinking about it again for the Weighted algorithm, perhaps it makes sense to simply use the same Task Buffer that would be applied to the Balanced algorithm. It keeps things simpler because it’s not just adding what seems to be an arbitrary number to the overall weight, and instead only affects the result of applying the Rendering task weight.

For example, let’s use the defaults:

Priority weight: 100
Submit Date weight: 3
Error weight: -10
Rendering Task weight: -100

We have two jobs A and B. For the sake of simplicity, both job A and B have a priority of 50, a submit date of 0, and an error count of 0. Job A is the current job and has 5 rendering tasks, while job B has 3 rendering tasks. This results in Job A having a weight of 5000+0-0-500=4500, and job B having a weight of 5000+0-0-300=4700. In this case, the slave will drop job A for job B.

If we had a task buffer of 3, then because job A is the current job, it will have a weight of 4800 (5000+0-0-200, since we’re subtracting the task buffer from the current rendering task count). So the slave will stick with job A. It would only move away from Job A if a job shows up that has 1 or 0 rendering tasks.

Since we can just use the task buffer for both algorithms, I don’t think it adds much complexity to the current system. A central scheduler would still need something to determine if a slave should stay on its current job or move to another anyway. With this new system, if you don’t want a slave ever jumping around, then you would set the Task Buffer to a really high number like 100000.

A central scheduler would still have to take all those things into account. Offloading the scheduling algorithm from the slaves to a central scheduler wouldn’t change the algorithm itself. If studios want to write their own scheduling logic from scratch, then that’s where opening up the scheduling logic to scripting would come in handy (which is something we still want to support).

Cheers,
Ryan

im_thatoneguy · June 20, 2014, 7:35pm

The largest reason I would see for a central dispatcher would be to also crunch Farm Simulations of schedules. E.g. I still really really want it to run some hypothetical numbers on every job and give me an “intelligent” completion time/date for jobs. This has to be done on something like Pulse which makes educated hypothesis (much like Lazlo’s dequeing idea) in order to do some machine learning maybe even. So if a job was predicted to finish (based on folder of submission (aka shot), max file size, previous similarly named jobs etc.) in 2 days but it checked in with Shotgun and let the artist know it would be finished on Wednesday, but the artist needs it Tuesday for review with the director they can take action to fix that. But it could also provide options, like "hey you could bump priority but then this job would be a day later. And at a more granular level the system could “Learn” “hey, by and large MachineX is 1/4 the speed of MahcineY.” And it could say something like “I’ve run 100 different simulations and this would reduce the total render time of all jobs by 5 hours by reordering these assets. Would you like to proceed?”

rrussell · June 20, 2014, 7:44pm

Hey Gavin,

Running a simulation wouldn’t require a central dispatcher, since all the required information is already stored in the database. This could just be a separate tool, or something built into the Monitor. It could even be a Python script that just loads in all the current job data and evaluates it (although I’m sure performance would suffer).

Cheers,
Ryan

im_thatoneguy · June 23, 2014, 10:03pm

Well, I guess then we’re getting into semantics. If a central AI determines the best job rendering order for optimal rendering and stores each Task/Slave pairing. “Slave 05 -> Job AFBEDFGA390130 \ Task001, Slave 06 -> Job …” then it’s not “dispatching” but it is creating a task queue in the database and then the database is acting as a central dispatcher just with slaves passively checking in with the database to see what they should be rendering next. I would call that a central dispatcher just with PULL requests instead of PUSH.

LaszloSebo · July 2, 2014, 11:57pm

2 more use cases for a central dispatcher:

We have simulations that run in clusters. The simulation is broken up in space to different quadrants, and each frame of the simulation is done for each quadrant on different machines (different jobs). Since for a machine to move on to the next frame, it needs the results from all the quadrants, its preferred to have the ‘same’ performance machines simming all the quadrants. If one is simmed by a 24 core machine, and another by an 8 core machine, the 24 core box will always be waiting on the results, essentially handicapping itself down to the slowest machine, wasting performance. A central dispatcher could ensure that all quadrants being simmed would be using the same core-count, and optimize “upwards”, so it COULD start with 8 core boxes, but if a 12/16/24 core box becomes available, it could reassign the next frame to that box. This kind of optimized scheduling is not currently possible.
“preferred” pool as job property. Say you have a job that you want to render on machines with 24 cores. You could set up a special pool for those, and set that as the ‘preferred pool’. In case no 24 cores are available, it could pick up on slower machines, but only if no 24 cores are there. Currently, you can’t really do this. Its the machines picking jobs, and not the jobs picking machines.

nrusch · July 9, 2014, 6:48pm

Another advantage is that this would not be an issue: http://forums.thinkboxsoftware.com/viewtopic.php?f=156&t=12066

LaszloSebo · September 23, 2014, 9:19pm

We have 2-5 minute delays between tasks are picked up by a slave, where they each manually check machine limits of jobs, limit groups etc. This generates a 50-60% idle farm.

Would we have a central dispatcher, they would have to do none of that…

Coulter · September 24, 2014, 4:17pm

It’s also possible that a central dispatcher could have the opposite effect.

It depends on how the aggregated average task completion rate (average task time divided by number of tasks being processed in parallel) relates to the dispatching logic cycle time. With a large state set, the amount of checking that has to happen is enormous. While a single dispatcher process has some advantages, it would still have to churn through all the state information for each task. If you have tens of thousands (or millions) of jobs each with hundreds (or thousands) of tasks, and if those jobs and tasks have dozens of constraints like dependencies and limits, it could happen that the cycle time of a central dispatcher falls substantially behind the demand resulting in lost utilization while Slaves wait for task allocations. It’s not just a matter of crunching an in-memory data structure to arrive at task allocations to slaves. All state decisions must be persisted to the database, which takes time (all while exogenous state changes are randomly pouring in).

The solution, of course, is to parallalize the dispatching logic. But how much?

At one end of the spectrum is the notion of a central dispatcher, and what we have now in Deadline is the other end of the spectrum where dispatching is maximally “parallelized” (decentralized) to each Slave. An optimal solution needs to balance the parallelization of dispatching logic against factors such as state connectedness and database locking overhead. Other factors, such as how nicely the database access patterns play with the database schema, and how sharding and replicas are deployed, can make or break performance. And the chaotic nature of resources (e.g. limits) greatly diminishes the effacacy of predictive approaches. And that’s just the tip of the iceberg.

As I’ve mentioned, this is an active area of research for us. We’re looking at the fruit at all heights of the tree: Low fruit, in the form of recent patches to DL 6.2; mid fruit as the improvements that have gone into Deadline 7, and high fruit as deeper changes being researched for Deadline 8.

nrusch · September 24, 2014, 5:41pm

…if it is designed and/or implemented poorly.

Using the idea of adapting Deadline’s current scheduling logic into a centralized model as a thought experiment is fine, but it probably wouldn’t have the best results. I think it’s important to keep in mind that that kind of a shift would almost certainly require throwing out a lot of the current logic, backend, and possibly the whole scheduling model (pools, groups, secondary pools, etc).

I continually bring Tractor up because they have been able to do this incredibly effectively. I obviously don’t have any numbers on what kind of scale Pixar’s farms are operating at, but I doubt we’re talking about small potatoes.

Coulter · September 24, 2014, 6:15pm

The hypothesis is that because a central dispatcher is all-seeing and all-knowing, it can make faster and better allocations. This implies a single process operating on the state. But a perfectly designed and implemented single-process centralized dispatcher may not be able to allocate as fast as demand. At least, not without parallelization.

But as soon as parallelization of the dispatching logic is introduced, it is technically no longer a centralized dispatcher, since now the parallel dispatchers have to manage contention on the state set.

My point was that framing the scheduling performance problem in terms of central or distributed is incorrect, since it’s a forgone conclusion that the dispatching logic will be distributed (if for no other reason than robustness). In terms of getting better task-to-slave allocations, it’s about providing better algorithms/heuristics/logic. And in terms of getting faster allocations, it’s about balancing parallelization computational speed against parallelalizaton overhead factors (state contention, network performance, etc.).

cbond · September 24, 2014, 8:13pm

Pixar’s farms are operating at<<

its my understanding that PIXAR itself does not use tractor internally as their primary farm.

cb

nrusch · September 24, 2014, 11:34pm

I think there’s an important piece missing here. “…faster and better allocations without introducing excessive overhead on the system as a whole”.

The signal-to-noise ratio of Deadline’s network traffic is quite high due to the need for every client to be constantly polling, not just for state information (for Monitors, etc), but also to see if any work is available.

Artists here don’t ask me if there are any new Nuke plugins available on a daily basis; I let everyone know when that happens. Deadline’s solution to the this (to mitigate load on the database server) is to tell the slaves, “well, just come ask me every 5 days instead of every day,” but this solution is only as effective as the system’s ability to predict a best-fit interval between task availability.

There is also a much higher possibility for corruption of the central state (which we have already experienced, and I believe Laszlo has as well), due to the fact that all updates are being performed remotely over an interface that could (theoretically) fail without warning, and (to my knowledge) are not being sanity-checked. It sounds like there are some serious issues with database contention at scale as well, though the future move to document-level locking on Mongo’s part seems like it will help there.

(Emphasis mine)

That’s a leap that I can’t quite make with you (unless you’re including multithreaded code under the umbrella of “distributed logic.”)

I agree with you, and I want to reiterate that I don’t have any inherent bias against a distributed scheduling model. However, I think it’s impossible to look at some of the current problems with network overhead, database contention, state corruption, etc. and not see how a central dispatcher would be able to alleviate or eliminate them.

nrusch · September 24, 2014, 11:36pm

Interesting… That’s definitely something I wasn’t aware of.