Machine priority

Panupat · August 29, 2018, 5:57pm

In our farm we have a handful of faster workstations. We have a limit in place due to license number we own, and we want all the faster ones to be rendering.

Reading this, I was hoping I could do this by giving the faster machines primary pool while giving the rest only secondary pool. But that doesn’t seem to be the case. If they come online at the same time, jobs seem to get distributed randomly regardless of primary/secondary pool. And once the slower boxes picked something up, it would keep picking up new jobs not allowing the faster ones to pick anything up.

Any suggestion please?

eamsler · August 29, 2018, 7:13pm

Slaves control what they work on, and there isn’t any coordination between Slaves. Secondary pool just means that the Slave scans twice for work to do if it didn’t find anything with the primary pool.

This is a Deadline question as old as anything though:

Render slave priority question (Year 2007)
Setting priority of render slaves (Year 2008)
Limiting jobs to pick up on faster machines (Year 2009)
Prioritize by machine (Year 2011)

This may become the new thread I reference for this question though. Feel free to add your +1 to this (it helps me track who needs it opposed to just having the heart on the topic)

Panupat · August 29, 2018, 7:20pm

Thanks Eamsler. I hope this feature gets implemented.

Our silly workaround is to force all the slower ones to go offline for a minute or 2. Once the faster ones pick up jobs they’ll keep picking more jobs

eamsler · August 29, 2018, 7:25pm

I think the best plan I’ve heard of is to allow cascading of limit groups where the “slow” limit could not be used until the “fast” limit filled to X number of machines. I’m going to go try and find the dev issue for this and add my new results to it.

Panupat · August 29, 2018, 10:11pm

Sounds very interesting. How do I set up this cascading limit?

eamsler · August 30, 2018, 1:41pm

Oh, it doesn’t exist yet. I usually just challenge customers and my team to think of an interesting way to solve the hard problems and this has been the best one (in my opinion).

There are issues with that theoretical system such as the manual setup of it. Some didn’t agree that it truly solves the problem. Another problem is if the slow machines are the only ones who can pick up a certain job, the farm would stall out because the fast limit would never fill… It’s just a hard problem.

panze · August 30, 2018, 1:53pm

Sounds very interesting. It’s something that we’ve also been pondering from time to time.

Couldn’t you just use something like job candidate filter in monitor to see if there are no candidates and then default to current behaviour?

eamsler · August 30, 2018, 2:33pm

How long would the Slave wait? It might be configurable.

It’s all around fairly messy… Right now I’m collecting all of the requests in one spot so we can have a sit down for this one day and hash it out again. This is really the biggest unsolvable issue we’ve had because of our lack of a central scheduler.

panze · August 31, 2018, 6:25am

Configurable sounds the most reasonable way to handle the wait.

Logically I think it shouldn’t be long since the goal is just to prioritize faster machines.

Panupat · October 5, 2018, 4:49am

I need to revive this thread because our department is running into lots of problem with this.

We can do nothing to get the faster machines consistently pick jobs up. Even if we disabled all the rest to force them to, after we enabled the rest it still loses it’s grip on jobs and go idle.
We cannot control the exact number of workstations for each project. Once the limit has been reached we’re pretty much relying on luck which workstations pick up the jobs. Sometimes those with pool A priority will pick up most jobs leaving pool B and pool C with very little machines. The pool distribution is decided by the managers and they’re coming down on us hard about this issue.

I’ve been thinking about using my own script to control exactly how many slaves comes online with my own priority, but then the plan turns really complicating really quick because we are not rendering with just 1 plugin/software in the farm : /

Right now our managers are deciding to acquire more dedicated render nodes but it would be such a waste if we cannot force them to ALWAYs pick jobs up.

eamsler · October 8, 2018, 2:37pm

The “best” workaround I’ve found so far is to use the power management code to start Slaves as work comes on the farm with “Override startup order” to have the fastest machines first, but again the scheduling model doesn’t fit the ability because all Slaves are equal regardless of pools or groups.

You would want to have “Send Remote Command to Start Slave…” enabled as that will ensure the Slave will be started if the machine is already running. It would drive artists a bit nuts if it’s running as a standard app through as it will start the Slave on their machine every time the power management check is run.

Panupat · October 17, 2018, 3:22am

Thanks Eamsler. The problem however is that it is not guarantee : / when our workstations come online, around 20% of our 24/7 nodes will stop rendering, giving room to the workstations for some reason. So being the first to come online does not mean it will always be rendering.

I already developed an early prototype for our studio use where we store our own limit and machine priority and only start only as many slave as we need. However the task is become more complicating now with extra features requested from our CG Sup. Not to go into too much detail, instead of using start/stop slave as way to control limit, we’ll be using this tools to assign pools instead.

eamsler · October 17, 2018, 2:19pm

This sounds like a licensing problem. Do the idle nodes turn a peach/orange colour? You may just need to buy 20% more licenses.

Pools can be a good solution. Their purpose is a way to group jobs by a secondary priority that can be arbitrary on a per-Slave bases, but it can work for other purposes

Panupat · October 17, 2018, 4:25pm

No we have no orange nodes. We have 100+ Deadline license. Our Redshift plugin only has a limit of 50-ish (we duplicated it from MayaBatch to give it its own limit separate from Vray). When all workstations come online af night or during weekend there are 20-30 idle stations after limit’s been met.

And about secondary pools, that was the way I hoped it would work, but it’s not. Let me try to explain what we ran into during out experimentation.

We have around 20 nodes which are our top-tier nodes. On these, we give them pools to be used as primary pool. Let’s say, main

On the rest of the workstations we only give them sub pool.

Still, when all the sub machines come online, some of the main WILL go idle.

Scott_Klupfel · July 10, 2020, 5:37am

Is this something that could be accommodated in Balancer?
I.E. Have on-prem balancing as well as cloud orchestration so that workers can be activated based on configurable algorithms?

My current plan is to disable workers on slower machines while the higher specification machines are available to accept jobs.

eamsler · July 10, 2020, 4:01pm

The Balancer is on a deprecation path at the moment, but you could in theory write a plugin that would activate machines in order.

Actually, thinking a bit farther down, you could enable/disable machines.

What’s worked out-of-the-box is enabling the power management as it has the ability to wake machines in a specific order.

NewJohnny · September 24, 2020, 5:22pm

This is a 15 year old problem that customers have been asking to be fixed and Thinkbox has refused to do anything about it. If this was a directive that came down from the top, it would be fixed within a week. It’s pretty clear to me that our needs don’t matter.

Eamsler, thanks for trying to help but band-aids are not the solution. If it comes down to buying a new render farm or giving up on Deadline, then I’ll be looking elsewhere.

Panupat · October 5, 2021, 10:04am

An update: I left my previous studio a year ago but at the end I wrote my own solution.

I put all machines in our own database with a speed column. I then query the list sorted by speed, and send the start slave command only enough to fill our Redshift license plus 5 or so as backup. The rest of the slave I sent a force stop command. The script was set to run on a 15 minute interval with tools for artist to schedule their machine up to a month in advance when they wanted their workstation included/excluded from the farm.