I just… I’ve completely lost all understanding of Deadline’s queueing system. It doesn’t follow any rhyme or reason for us anymore.
I’m looking at two identical jobs:
Same user, Same priority, same pool, same group, same limit groups, same blacklist, same plugin. Half of the slaves are rendering one job… half the other. Then they get “bored” or something and go back to all rendering one… and then a few break off and render the other. We have a lower priority job that is currently using 3/4 of the farm. Don’t ask me why. Who renders what now is a complete toss of the dice.
It’s been this way for all of 7.0 and now also with 7.2. Is there a master log somewhere I can find to send to you to see if you can make any sense of it?
It’s possible there is a DB state issue or bug of some kind, but I would consider those only after eliminating other possibilities. Let’s start with the basics. What is the exact Deadline version/build and which Job Scheduling algorithm is selected?
Also, you mentioned Blacklists. Despite the seemingly incomprehensible ordering, are the Blacklists being respected?
I’ve noticed a very similar (although maybe not exact) issue with Deadline and our farm since moving from 6.2 to 7.1.
My question - are you running pulse - and if you do a pulse reset, does it fix the problem?
When our machines break bad and start acting erratically - reseting pulse usually solves it.
The symptoms for us include:
Machines not picking up based on our Repositories “Job Scheduling Order”
Large variances of rendertimes on machines of similar builds/same machine. For example, frame 1 takes 1 hr, frame 2 takes 3 hrs on the same machine/or same class
Latest 7.2. It was doing it with 7.1 as well but you guys thought maybe 7.2 had fixed it. It didn’t. Default scheduling algorithm: Pool, Priority, First In/Out.
Pulse was pretty freshly restarted (less than a week) and it’s been going on since 7.1 so it doesn’t seem to be a specific problem with the pulse server flaking out here and there.