I’ve read some older threads on this, some a few years old, but I’m wondering has there been any progress on controlling the number of multi slave tasks per machine? for example, on a 32 thread instance, I might run 32 slave instances for the ability to run up to 32 tasks single threaded, but when I have a sim that needs 16, or any geo processing that uses more than 1 core I need to make other instances unavailable or things aren’t going to go well!
From my reading and new / limited understanding, multi slave instances, concurrent tasks, and cpu affinity cannot achieve this type of behavior on a per job basis.
I’m fairly new to Deadline. I’m familiar with other schedulers that I have used as an FX TD at Sony, Weta, and ILM each with their own strengths and weaknesses. At small studios, some schedulers have had that ability. I do like to get the most out of my resources… and I’m starting to bump into this waste issue that I really hope is going to be handled somehow, if not already, I apologise if I’m just not familiar with my options yet.
So far I’ve managed to automate the rollout of Deadline and configuration into this open source infrastructure using Ansible, Vagrant, and Terraform to provision infrastructure in AWS. Now that I’ve got some of this working in Houdini 17.5 with its amazing Procedural Dependency Graph, I’m bumping into this problem of being able to use my licenses and cores efficiently.
To be able to use something like an AWS 96 VCPU instance and get the most out of any costly software license, this is really needed.
When an instance starts a job, can I query other instances cpu utilization and disable them as a plugin? are there any hooks / other implementation people have done to partly solve this?
I think that a job should be able to specify how many cores (or core units, sometimes you might have tasks that should only use .5 cores perhaps ) it is using, and ensure the number of instances free/enabled doesn’t exceed the number of available units.
if there’s no way to achieve it easily, then I need to build a plugin to do something like this-
-A task must be sent with an extra attribute (core units).
-When the scheduler sends the task out it must query the units free on a node, and only send it to a slave instance on a node if the count requested is < the total free core units for that node. tasks submission must be serialised per node to avoid race conditions.
-When a task is sent it will disable / mark unavailable any extra instances based on the core count request, so as to not misrepresent slave availability in the monitor.
-Probably need to consider this option/situaiton - that as higher core count tasks step up in the queue waiting to find a slot to start that cores need to be allowed to free up ahead in order for the next task to run. First In First Out / Priority Ordering might have to do this crudely. eg if the next task in line wants 16 cores in a heavy sim / geo process group, then you may have multiple render nodes in that group sit with empty slots until that job can start. its only for the time it takes for the high core count job to start though. once started, the next jobs in line can still fill those slots where possible, and only that group of nodes (in say the fx sim group) would be negatively affected until that point.
I don’t want to focus on the next problem of ram here, but for consideration, the same principle should be applied to memory although with memory, the attribute is one that should be dynamically alterable. eg, if we render every 10th frame, we should be able to automatically interpolate a max memory estimate line across the whole sequence to predictably correct for any max ram estimate a user provided, or in the case of non asynchronous tasks (simulation), just alter that memory usage tracking live always to decide if anything else can be recieved by the node at all. Tasks could also use identifier tags (usually something like node id from the submitter / software but the user could provide just a description for this) to be able to estimate in the db from its history what memory is probably needed for submission. that last part is trickier I guess… But some memory awareness would be good to see. innacuracy doens’t matter too much - when a failure due to memory maxed out occurs because the estimate was wrong, the job is just resubmitted with that memory request increased for the next run, which minimises the damage it can do.
These types of features, especially core tracking should be supported out of the box. Anyone who needs a renderfarm scheduler has expensive enough resources to need these abilities. even with a single workstation which is worth more than the car that I drive, I can’t afford to be wasteful.
If these types of problems aren’t already solved to a good degree, and I’m just too new to grasp whats available I apologise. I really hope that deadline with its ability to provide a rendering backend for some very heavy computing in AWS has plans to really nail this problem. It’s not cool to be so wasteful of rendering costs, or something like a 96 vcpu instance, or of users onsite systems, their own money, and most importantly the environment.
I look forward to hearing what I might be able to learn about this. Thanks for reading!