Optimising / limiting multi slave instances based on cores requested. Where are these abilities?

I’ve read some older threads on this, some a few years old, but I’m wondering has there been any progress on controlling the number of multi slave tasks per machine? for example, on a 32 thread instance, I might run 32 slave instances for the ability to run up to 32 tasks single threaded, but when I have a sim that needs 16, or any geo processing that uses more than 1 core I need to make other instances unavailable or things aren’t going to go well!

From my reading and new / limited understanding, multi slave instances, concurrent tasks, and cpu affinity cannot achieve this type of behavior on a per job basis.

I’m fairly new to Deadline. I’m familiar with other schedulers that I have used as an FX TD at Sony, Weta, and ILM each with their own strengths and weaknesses. At small studios, some schedulers have had that ability. I do like to get the most out of my resources… and I’m starting to bump into this waste issue that I really hope is going to be handled somehow, if not already, I apologise if I’m just not familiar with my options yet.

So far I’ve managed to automate the rollout of Deadline and configuration into this open source infrastructure using Ansible, Vagrant, and Terraform to provision infrastructure in AWS. Now that I’ve got some of this working in Houdini 17.5 with its amazing Procedural Dependency Graph, I’m bumping into this problem of being able to use my licenses and cores efficiently.

To be able to use something like an AWS 96 VCPU instance and get the most out of any costly software license, this is really needed.

When an instance starts a job, can I query other instances cpu utilization and disable them as a plugin? are there any hooks / other implementation people have done to partly solve this?

I think that a job should be able to specify how many cores (or core units, sometimes you might have tasks that should only use .5 cores perhaps ) it is using, and ensure the number of instances free/enabled doesn’t exceed the number of available units.

if there’s no way to achieve it easily, then I need to build a plugin to do something like this-

-A task must be sent with an extra attribute (core units).
-When the scheduler sends the task out it must query the units free on a node, and only send it to a slave instance on a node if the count requested is < the total free core units for that node. tasks submission must be serialised per node to avoid race conditions.
-When a task is sent it will disable / mark unavailable any extra instances based on the core count request, so as to not misrepresent slave availability in the monitor.

-Probably need to consider this option/situaiton - that as higher core count tasks step up in the queue waiting to find a slot to start that cores need to be allowed to free up ahead in order for the next task to run. First In First Out / Priority Ordering might have to do this crudely. eg if the next task in line wants 16 cores in a heavy sim / geo process group, then you may have multiple render nodes in that group sit with empty slots until that job can start. its only for the time it takes for the high core count job to start though. once started, the next jobs in line can still fill those slots where possible, and only that group of nodes (in say the fx sim group) would be negatively affected until that point.

I don’t want to focus on the next problem of ram here, but for consideration, the same principle should be applied to memory although with memory, the attribute is one that should be dynamically alterable. eg, if we render every 10th frame, we should be able to automatically interpolate a max memory estimate line across the whole sequence to predictably correct for any max ram estimate a user provided, or in the case of non asynchronous tasks (simulation), just alter that memory usage tracking live always to decide if anything else can be recieved by the node at all. Tasks could also use identifier tags (usually something like node id from the submitter / software but the user could provide just a description for this) to be able to estimate in the db from its history what memory is probably needed for submission. that last part is trickier I guess… But some memory awareness would be good to see. innacuracy doens’t matter too much - when a failure due to memory maxed out occurs because the estimate was wrong, the job is just resubmitted with that memory request increased for the next run, which minimises the damage it can do.

These types of features, especially core tracking should be supported out of the box. Anyone who needs a renderfarm scheduler has expensive enough resources to need these abilities. even with a single workstation which is worth more than the car that I drive, I can’t afford to be wasteful.

If these types of problems aren’t already solved to a good degree, and I’m just too new to grasp whats available I apologise. I really hope that deadline with its ability to provide a rendering backend for some very heavy computing in AWS has plans to really nail this problem. It’s not cool to be so wasteful of rendering costs, or something like a 96 vcpu instance, or of users onsite systems, their own money, and most importantly the environment.

I look forward to hearing what I might be able to learn about this. Thanks for reading!

It looks like limits, could potentially solve this problem if they were extended further
https://docs.thinkboxsoftware.com/products/deadline/10.0/1_User%20Manual/manual/limits.html

The only problem with limits I think is we couldn’t use them per machine, they appear to be global… Still I feel like this could be the right path. we would need to have cpu stubs and ram stubs per machine. if the number requested by a job was below the available count for a machine that would work.
perhaps we need a new type of limit - a per machine limit, that could be defined by key: value pairs for each machine.

This feature request seems to be referred to elsewhere in the forums as “slots” I think. Is there still an intention to implement it or to have per-slave limits?

If not, it would be great to have a blog post describing how to use the current features in deadline to efficiently utilise cpu usage in cases like Andrew describes above (with heterogenous job type requirements on heterogeneous farm machines)

// edit: by heterogenous i mean a mix of high core and low core, or high ram and low ram

Hey folks,

There’s no really simple way to get ‘slots’ (to use that SGE terminology) working with the tools you’ve got in Deadline as-is. It was brought up a lot a couple of years ago as a must-have feature, but the general lack of demand has let it slide down the priority list.

I hate to leave you with this shrug of a response, but keep being noisy about how much you’d like this (and any other features for that matter). We are seeing it!

You might be able to use Pools, Limits and Groups to help manage your hardware, but that could be more case-by-case. I always like to drop a link to this blog post whenever I mention Pools, Limits, and Groups

Let us know if you’ve got more questions and we’ll get back to you!

We would very much like that functionality too, and was quite surprised that Deadline doesn’t have it.

In our situation, we have a mixed render farm, some with eg. 32 cores and some with 48 cores. For various reasons (eg. licensing, cost of having many blades vs fewer blades with more memory and CPU) we want to be able to stack multiple jobs per blade.

Problem is, (and please correct me if I’m wrong) if I set my jobs to ‘2 concurrent tasks’ and 16 threads each, then the 48 core blades will just be sitting there with 1/3rd of the machine idle.

On other farms I’ve used all you need to specify is a reservation of number of cores, and the farm scheduling takes care of making sure all available cores on the machines get occupied. Eg. in the above case, I could have 3x16 thread tasks running on a 48 core blade, and 2x16 thread tasks running on the 32 core blade. Or maybe one of those 16 thread tasks could be replaced by several 2 or 4 thread tasks simultaneously.

This is an extremely important feature, especially for those of us with small farms who need to get the maximum efficiency out of it.

When it comes to threading, Deadline has generally left the management to the OS. However, Concurrent Tasks are usually not the feature you are looking for, but rather Multiple Worker Instances.

As you probably know, you can run any number of Workers on the same physical or virtual machine using a single Deadline license. Each of these Worker instances can dequeue a different job based on Pools, Groups and Limits, and the typical approach is to set these Workers to prefer dissimilar types of Jobs that have different bottlenecks.

For example, a Nuke or AE comp is mostly IO-bound and will rarely saturate all available cores, while a 3D rendering job will likely use all CPUs and have some IO in the very beginning, but little after that. So running two Workers where one is assigned to a Nuke group and the other is assigned to the V-Ray or Arnold or whatever group would saturate both the CPUs and the IO of the machine and theoretically be more efficient than running one render node with just Nuke, and another render node with just 3D rendering.

Oversubscribing to cores is of course handled by the OS, and is probably a better situation than limiting each Worker to a subset of all cores - as you mentioned, you could end up with CPUs sitting idle if your numbers don’t add up.

So in general we don’t recommend running two Workers that are doing exactly the same. While 3D renderers rarely scale linearly with the increase of cores, letting two copies of Arnold fight for 48 cores is a bad idea. But letting Arnold and Nuke fight for them is usually fair due to their different usage patterns. Neither one will perform as fast as when running alone on the system, but they will squeeze out more performance by filling out each-other’s gaps, so in the end your resource utilization will be better.

Of course, you could also limit the CPU Affinity of the multiple Workers as needed.

Concurrent Tasks were meant for processes that are inherently single-threaded, which explains why there is an option to limit the Concurrent Tasks count to the Task Limit, which defaults to the CPU count…

We also have customers who run multiple VMs on the same physical machine, and oversubscribe to cores, while keeping the types of jobs these VMs process diversified as I suggested above. For example, on a 48 core machine you could have two 24 core VMs and one 8 core VM ensuring there is no “slack” in the usage if any of them fail to saturate all assigned cores… The drawback of this approach is that one Deadline license is required per VM, so it ends up a bit more expensive.

I am not saying that Slots would be a bad idea, just that others are running multiple Jobs per machine and letting the OS and/or the virtualization handle the oversubscription to cores…

Thanks for your reply Bobo!

For our scenario, this is mainly about efficient use of licences on a small number of machines. As you explained, different tasks are able to saturate cores at different core counts, and this changes a lot in our scenario. Being able to put licences to use on AWS with >96 threads is attractive, but with deadline its relatively more wasteful compared to SGE being able to stack up requests for cores dynamically.

Fluid simulation is always more efficient for us with lower core counts (4 vs 96), but we use higher core counts when the wait time for an artist is high and willing to pay for the diminishing returns.

Being dynamic on slots per task with a single instance to leveredge a licence is very important though, more than ever now with PDG and its abilities, since a single node may run sources, multiple sim wedges, post processing and flipbooks in a per frame dependent workflow for an entire shot which is an order of magnitude more efficient for artists/real human time. As you can imagine - maintaining per frame dependent workflow for that many tasks throughout a big tree means you may use anything from 2-64 thread spread across tasks, whatever is required to balance the load most effectively, and we are able to do this with a single licence.

So with Side FX PDG, there could be any spread of tasks with different core requirements, and baking max core counts into multiworker instances on the same slave is not nice. multiworkers cannot change their cpu allocation easily so the feature whilst I experimented with it for a long time, writing custom services to change the core counts per slave on boot- was a cause of pain, and still severely limited. SGE doesn’t have this limitation and I would love to see deadline achieve similar abilities.

Human efficiency is more important in this scenario - we can kill sims a fraction of the way through when we can see the preview frames at the same time and we decide to make a tweak - I present some work on a project I started- Open Firehawk at SIGRAPH on this topic. We are using deadline here - https://youtu.be/Ahw8pXu5RyY

Also, even in scenarios where efficiency for similar tasks may scale well with higher core counts, its still beneficial to run multiple tasks on less cores to saturate network IO to achieve a higher duty cycle if we have very heavy network IO for a portion of the time.

eg, if the network duty cycle is say 30% of the total sequential time for any given limited resource (bound by 1 licence say), then concurrent tasks of the same type will reduce the end to end time- until the network is nearly saturated. CPU saturation and throughput efficiency will also increase as the cores per task lowers until network saturation is reached.