sorry for the missleading title - ‘tasks continually restarting’ is actually a side effect of this i believe.
when i investigated further I then noticed the machine was running the multiple slaves
Thanks for the information! A couple follow up questions…
Just to confirm, did you check this on a machine that was running more than one slave instance? The “ps” test you ran only shows one slave instance running on that machine.
Also, on a machine that has multiple instances like this running, can you grab the logs from that day (and maybe the previous day), zip/tar them up, and post them? You can find the logs on the render node in /var/log/Thinkbox/Deadline7.
Thanks for the logs! Based on the logs, it looks like two deadlinelaunchers are trying to start when the system boots up. This might have something to do with the problem.
Can you go to the /etc/init.d folder on this machine and see if there is more than one file that looks like ‘deadlinelauncherservice’ or ‘deadline#launcherservice’ (where # represents the Deadline version)? If there is more than one, try deleting the others EXCEPT for ‘deadline7launcherservice’, then reboot the machine and let us know if the problem happens again. If it does, send us the logs again.
We also have a problem with tasks constantly restarting. Its affecting a large chunk of our jobs. The machine picks up the job, starts rendering, then i notice a couple minutes later that another machine takes over the task. Later, the first machine has this in the log:
2015-01-28 10:01:20: 0: INFO: STARTED
2015-01-28 10:01:21: 0: INFO: Lightning: Render frame 990
2015-01-28 10:04:50: Scheduler Thread - Task "39_990-990" could not be found because task has been modified:
2015-01-28 10:04:50: current status = Rendering, new status = Rendering
2015-01-28 10:04:50: current slave = LAPRO0623, new slave = LAPRO0656
2015-01-28 10:04:50: current frames = 990-990, new frames = 990-990
2015-01-28 10:04:50: Scheduler Thread - Cancelling task...
2015-01-28 10:04:51: 0: In the process of canceling current task: ignoring exception thrown by PluginLoader
2015-01-28 10:04:51: 0: Unloading plugin: 3dsmax
2015-01-28 10:04:58: Scheduler Thread - In the process of canceling current tasks: ignoring exception thrown by render thread 0
No error log, no requeue log. In fact, max is usually still rendering when this happens.
This happened on jobs that were rendering happily for a day, then since yesterday around 6pm, they just constantly requeue…
This is happening because you likely still have some slaves (or pulse) running an older version of Deadline 7. You can use the version column in the Slave list in the Monitor to check the versions of the slave. It is highly recommended to get all machines upgraded to 7.0.2.3 (the current public release).
Note that we have been running without any issues for 2 weeks. This started happening (and still is happening) since yesterday on a large chunk of jobs, none of which progressed a single frame (while other jobs are going just fine).
All signs point at this being unrelated to version discrepancies. We are mostly on 7.0.2.2, afaik .3 only had a minor maxscript related fix.
Ill move all machines to 7.0.2.2 to make sure its consistent, and report back.
I suspect that perhaps an old slave (or pulse) was started up yesterday, and that’s what is causing this. Have you checked the pulse list in the Monitor to see if there is another pulse running?
Also, in your slave list, do you have any slave lines that are mostly blank (and the slave is a blue color)? If so, those would be old versions of the slave as well.
I did find a couple of machines that were blue that werent even supposed to be in the slave list (administrative images etc), i disabled them ~10 minutes ago. Monitoring now.
It must have been one of those rogue machines with an old image. I am starting to see some tasks finish now…
I really wish we could globally turn off slaves doing the pulse operations so that we could eliminate this risk in a controlled manner.
With 2k+ slaves, its practically impossible to control whats going on whenever a breaking change is introduced in a deadline build, or there is some disruption between a slave and pulse communications. Currently, the slaves acting as fallback pulse machines is not providing redundancy, only a risk. I think with the new “multiple pulse” feature, we should be able to create a robust pulse network without relying on the slaves.
Really glad to hear you were able to find the rogue slaves. This was a really nasty bug, and I think your idea of having a global option to disable housecleaning for the slaves is a good idea. We’ll add that to the todo list for 7.1.
Thanks Ryan. Yeah, am i ever glad we could sort this out with relative ease. I had a near heart attack this morning haha.
Thanks for adding that switch in 7.1, i think it will provide peace of mind (also, will light the fire under our bottom to set up the redundant pulses )