I’m not sure if this has been reported before, but we have a slave that’s been stuck on a task that doesn’t exist for over 3 days. The logs show that it was attempting the task, and then it just switched to echoing this over and over again:
2014-09-09 11:15:12: Scheduler Thread - Task "0_1-1" could not be found because job with ID 540f3f7c2b64553b72a627a5 does not exist
2014-09-09 11:15:12: Scheduler Thread - Cancelling task...
2014-09-09 11:15:36: Scheduler Thread - Task "0_1-1" could not be found because job with ID 540f3f7c2b64553b72a627a5 does not exist
2014-09-09 11:15:36: Scheduler Thread - Cancelling task...
2014-09-09 11:15:59: Scheduler Thread - Task "0_1-1" could not be found because job with ID 540f3f7c2b64553b72a627a5 does not exist
2014-09-09 11:15:59: Scheduler Thread - Cancelling task...
2014-09-09 11:16:20: Scheduler Thread - Task "0_1-1" could not be found because job with ID 540f3f7c2b64553b72a627a5 does not exist
2014-09-09 11:16:20: Scheduler Thread - Cancelling task...
2014-09-09 11:16:44: Scheduler Thread - Task "0_1-1" could not be found because job with ID 540f3f7c2b64553b72a627a5 does not exist
2014-09-09 11:16:44: Scheduler Thread - Cancelling task...
This might actually be related to the slave log issue. I had seen this happen when I was debugging the issue, and I thought it might be related to the slave becoming unstable when dealing with the large logs. I made a note to look into this after improving the log handling, and then I was never able to reproduce. That could mean it is related and it is fixed now, or it could be complete coincidental. Either way, we’ll continue to keep our eyes on this one.
Good to know. Is there anything in the log before it starts printing out those messages? That would at least let us know what it was doing before it got stuck.
The last thing in the log before those messages started (about 25 seconds before) is just a STDOUT message from the process that’s being executed by our plugin.
Is your plugin an Advanced or a Simple plugin? Simple plugins should automatically detect if the render has been cancelled and react accordingly, but if an Advanced plugin is looping somewhere, it needs to check manually if the render has been cancelled and break out (ie: by calling FailRender). Note that once the python environment is sandboxed in v8, the manual checking for Advanced plugins shouldn’t be required any more.
Do you know if that’s the last line that should have been printed out by that process? Just wondering if this issue is occurring when the process is about to exit, or if it was still in the middle of something. Speaking of which, do you know if the process itself still running?
Thanks for the additional info. If you do see this happen again, it would be great to know if the rendering process is still running. I’m guessing that the “cancel” operation is either getting stuck when it’s trying to kill the task, or it gets stuck afterwards during cleanup.
It seems like this has been happening in other places around the farm. Here’s the tail of another log (from the rd-082 slave):
2014-09-15 16:24:37: 0: STDOUT: Validating nodes... done
2014-09-15 16:24:37: 0: STDOUT: luma.nukeRender : INFO : Prepping script for render...
2014-09-15 16:24:40: 0: STDOUT: luma.nukeRender : INFO : Resolved output files:
2014-09-15 16:24:40: 0: STDOUT: luma.nukeRender : INFO : filepath.FileSequence(u'xxx', format='database')
2014-09-15 16:24:40: 0: STDOUT: luma.nukeRender : INFO : filepath.FileSequence(u'xxx', format='database')
2014-09-15 16:24:40: 0: STDOUT: luma.nukeRender : INFO : filepath.FileSequence(u'xxx', format='database')
2014-09-15 16:24:40: 0: STDOUT: luma.nukeRender : INFO : Rendering node: xxx
2014-09-15 16:25:13: 0: STDOUT: .6.1.2.5.8.9.2.3.4.5.6.7.8.9
2014-09-15 16:25:13: 0: STDOUT: Writing xxx.dpx took 29.26 seconds
2014-09-15 16:25:14: 0: STDOUT: .4
2014-09-15 16:25:14: 0: STDOUT: Writing xxx.jpg took 0.47 seconds
2014-09-15 16:25:16: 0: STDOUT: .8
2014-09-15 16:25:16: 0: STDOUT: Writing xxx.jpg took 0.30 seconds
2014-09-15 16:39:56: Scheduler Thread - Task "17_1049-1051" could not be found because task has been modified:
2014-09-15 16:39:56: current status = Rendering, new status = Rendering
2014-09-15 16:39:56: current slave = rd-082, new slave = rd-106
2014-09-15 16:39:56: current frames = 1049-1051, new frames = 1049-1051
2014-09-15 16:39:56: Scheduler Thread - Cancelling task...
2014-09-15 16:40:17: Scheduler Thread - Task "17_1049-1051" could not be found because task has been modified:
2014-09-15 16:40:17: current status = Rendering, new status = Rendering
2014-09-15 16:40:17: current slave = rd-082, new slave = rd-106
2014-09-15 16:40:17: current frames = 1049-1051, new frames = 1049-1051
2014-09-15 16:40:17: Scheduler Thread - Cancelling task...
That’s the end of the log. And then 20 minutes later, the slave was marked as stalled (per the settings in our repo). The slave machine was left in an unusable state, and the entire process tree (including the slave process) was still there, but marked defunct.
Here’s the tail end of the launcher log (before the machine was rebooted):
2014-09-15 17:01:46: Checking for stalled slaves on this machine
2014-09-15 17:01:46: - Restarting slave rd-082 because it has stalled
2014-09-15 17:01:46: Sending command to slave: StopSlave
2014-09-15 17:01:56: Got reply: rd-082.luma-pictures.com: Sent "StopSlave" command. Result: "timed out while waiting for reply"
2014-09-15 17:02:26: Slave 'rd-082' failed to exit after 30 seconds, killing it...
2014-09-15 17:07:26: Updating Repository options
2014-09-15 17:07:26: - Remote Administration: enabled
2014-09-15 17:07:26: - Automatic Updates: enabled
2014-09-15 17:07:26: Checking for stalled slaves on this machine
2014-09-15 17:07:26: - Restarting slave rd-082 because it has stalled
2014-09-15 17:07:26: Sending command to slave: StopSlave
2014-09-15 17:07:37: Got reply: rd-082.luma-pictures.com: Sent "StopSlave" command. Result: "timed out while waiting for reply"
2014-09-15 17:08:07: Slave 'rd-082' failed to exit after 30 seconds, killing it...
2014-09-15 17:13:07: Updating Repository options
2014-09-15 17:13:07: - Remote Administration: enabled
2014-09-15 17:13:07: - Automatic Updates: enabled
2014-09-15 17:13:07: Checking for stalled slaves on this machine
2014-09-15 17:13:07: - Restarting slave rd-082 because it has stalled
2014-09-15 17:13:07: Sending command to slave: StopSlave
2014-09-15 17:13:17: Got reply: rd-082.luma-pictures.com: Sent "StopSlave" command. Result: "timed out while waiting for reply"
2014-09-15 17:13:47: Slave 'rd-082' failed to exit after 30 seconds, killing it...
2014-09-15 17:18:47: Updating Repository options
2014-09-15 17:18:47: - Remote Administration: enabled
2014-09-15 17:18:47: - Automatic Updates: enabled
2014-09-15 17:18:47: Checking for stalled slaves on this machine
2014-09-15 17:18:47: - Restarting slave rd-082 because it has stalled
2014-09-15 17:18:47: Sending command to slave: StopSlave
2014-09-15 17:18:57: Got reply: rd-082.luma-pictures.com: Sent "StopSlave" command. Result: "timed out while waiting for reply"
2014-09-15 17:19:27: Slave 'rd-082' failed to exit after 30 seconds, killing it...
Same thing happened on another machine that hasn’t been restarted in about a week. This was after a manual task re-queue:
2014-09-15 17:17:12: 0: STDOUT: luma.imagepath : WARNING : Skipping invalid knob: 'streamOpt'
2014-09-15 17:17:12: 0: STDOUT: luma.imagepath : WARNING : Skipping invalid knob: 'rfcCompat'
2014-09-15 17:17:12: 0: STDOUT: Writing xxx.mov took 0.17 seconds
2014-09-15 17:17:12: 0: STDOUT: Writing xxx.mov took 0.15 seconds
2014-09-15 17:19:08: 0: STDOUT: 2014-09-15 17:19:07.006 Nuke[79438:4127] This process is attempting to exclude an item from Time Machine by path without administrator privileges. This is not supported.
2014-09-15 17:19:08: 0: STDOUT: 2014-09-15 17:19:07.006 Nuke[79438:383b] This process is attempting to exclude an item from Time Machine by path without administrator privileges. This is not supported.
2014-09-15 19:05:27: Scheduler Thread - Task "0_1-1" could not be found because task has been modified:
2014-09-15 19:05:27: current status = Rendering, new status = Rendering
2014-09-15 19:05:27: current slave = rd-205-07, new slave = rd-206-07
2014-09-15 19:05:27: current frames = 1-1, new frames = 1-1
2014-09-15 19:05:27: Scheduler Thread - Cancelling task...
2014-09-15 19:05:50: Scheduler Thread - Task "0_1-1" could not be found because task has been modified:
2014-09-15 19:05:50: current status = Rendering, new status = Rendering
2014-09-15 19:05:50: current slave = rd-205-07, new slave = rd-206-07
2014-09-15 19:05:50: current frames = 1-1, new frames = 1-1
2014-09-15 19:05:50: Scheduler Thread - Cancelling task...
2014-09-15 19:06:12: Scheduler Thread - Task "0_1-1" could not be found because task has been modified:
2014-09-15 19:06:12: current status = Rendering, new status = Completed
2014-09-15 19:06:12: current slave = rd-205-07, new slave = rd-206-07
2014-09-15 19:06:12: current frames = 1-1, new frames = 1-1
2014-09-15 19:06:12: Scheduler Thread - Cancelling task...
2014-09-15 19:06:33: Scheduler Thread - Task "0_1-1" could not be found because task has been modified:
2014-09-15 19:06:33: current status = Rendering, new status = Completed
2014-09-15 19:06:33: current slave = rd-205-07, new slave = rd-206-07
2014-09-15 19:06:33: current frames = 1-1, new frames = 1-1
2014-09-15 19:06:33: Scheduler Thread - Cancelling task...
The task seemed to be stuck (which you can see from the time stamp gap in the logs after 17:19:08), so I re-queued it. After that, the slave went into the “Cancelling task” loop, where it would likely have stayed indefinitely, except that I shut it down remotely (in this case, the slave process did actually respond to the remote shutdown command, as opposed to the example from my previous post). Also, in this case, the process seems to have been terminated by the slave before I stopped the slave process.
Also, I went back and looked at the reports for the second-to-last post (about the rd-082 slave), and that was also a case where I manually re-queued a stuck task from the Monitor, after which it went comatose.
I’m a little worried that Deadline has more than a little to do with these tasks sticking, since this didn’t used to happen with Rush.
Thanks for the additional logs and info. We have this logged as a bug, and hopefully we can reproduce this so that we can figure out where things are getting stuck.
If you happen to come across any additional info regarding this issue, let us know!
It would be great to add some robustness into cancellation handling, we find stuck slaves as well on occasion. The larger the farm, the more hard to notice these issues.
Whatever loop is it that’s printing those messages, should probably have a fail-safe exit condition.
This has been happening more and more frequently under heavy load. I’m seeing some slightly different behavior now as well where I’ll re-queue a task (because the render is stuck), another slave will pick it up, and the original slave will get stuck trying to switch to something else with a loop like this:
Scheduler Thread - Cancelling task...
Scheduler Thread - Task "46_1053-1053" could not be found because task has been modified:
current status = Rendering, new status = Rendering
current slave = rd-027, new slave = ws-033
current frames = 1053-1053, new frames = 1053-1053
Scheduler Thread - Cancelling task...
Scheduler Thread - Task "46_1053-1053" could not be found because task has been modified:
current status = Rendering, new status = Rendering
current slave = rd-027, new slave = ws-033
current frames = 1053-1053, new frames = 1053-1053
Scheduler Thread - Cancelling task...
and on and on indefinitely.
I don’t understand why the slave can’t abort what it’s doing…
We still have not had any luck reproducing this here.
We’re assuming that the reason it can’t abort is because there is a deadlock on the same thread that would check if the task had been cancelled. This will for sure be addressed in Deadline 8 because we can just kill the sandboxed process if it doesn’t cancel in a reasonable amount of time.
However, it would still be nice to try and figure out why this problem is occurring. One way we could do this is to enable debug logging on the plugin that these jobs get stuck for. This can be done by adding the following line to your dlinit file:
DebugLogging=True
Note that this will add a lot of additional text to the logs, but if a task gets stuck with debug logging on, the debug output should give us a sense of what’s happening.