Slave stuck trying to cancel task

nrusch · September 12, 2014, 6:46pm

I’m not sure if this has been reported before, but we have a slave that’s been stuck on a task that doesn’t exist for over 3 days. The logs show that it was attempting the task, and then it just switched to echoing this over and over again:

2014-09-09 11:15:12: Scheduler Thread - Task "0_1-1" could not be found because job with ID 540f3f7c2b64553b72a627a5 does not exist 2014-09-09 11:15:12: Scheduler Thread - Cancelling task... 2014-09-09 11:15:36: Scheduler Thread - Task "0_1-1" could not be found because job with ID 540f3f7c2b64553b72a627a5 does not exist 2014-09-09 11:15:36: Scheduler Thread - Cancelling task... 2014-09-09 11:15:59: Scheduler Thread - Task "0_1-1" could not be found because job with ID 540f3f7c2b64553b72a627a5 does not exist 2014-09-09 11:15:59: Scheduler Thread - Cancelling task... 2014-09-09 11:16:20: Scheduler Thread - Task "0_1-1" could not be found because job with ID 540f3f7c2b64553b72a627a5 does not exist 2014-09-09 11:16:20: Scheduler Thread - Cancelling task... 2014-09-09 11:16:44: Scheduler Thread - Task "0_1-1" could not be found because job with ID 540f3f7c2b64553b72a627a5 does not exist 2014-09-09 11:16:44: Scheduler Thread - Cancelling task...

rrussell · September 12, 2014, 7:09pm

This might actually be related to the slave log issue. I had seen this happen when I was debugging the issue, and I thought it might be related to the slave becoming unstable when dealing with the large logs. I made a note to look into this after improving the log handling, and then I was never able to reproduce. That could mean it is related and it is fixed now, or it could be complete coincidental. Either way, we’ll continue to keep our eyes on this one.

Cheers,
Ryan

nrusch · September 12, 2014, 7:12pm

Hmm, OK. For what it’s worth, this wasn’t one of those jobs with the heinous amounts of output.

rrussell · September 12, 2014, 7:20pm

Good to know. Is there anything in the log before it starts printing out those messages? That would at least let us know what it was doing before it got stuck.

nrusch · September 12, 2014, 7:56pm

The last thing in the log before those messages started (about 25 seconds before) is just a STDOUT message from the process that’s being executed by our plugin.

rrussell · September 12, 2014, 8:03pm

Is your plugin an Advanced or a Simple plugin? Simple plugins should automatically detect if the render has been cancelled and react accordingly, but if an Advanced plugin is looping somewhere, it needs to check manually if the render has been cancelled and break out (ie: by calling FailRender). Note that once the python environment is sandboxed in v8, the manual checking for Advanced plugins shouldn’t be required any more.

nrusch · September 12, 2014, 8:06pm

It’s a Simple plugin.

rrussell · September 12, 2014, 8:10pm

Thanks for confirming!

rrussell · September 12, 2014, 8:16pm

Do you know if that’s the last line that should have been printed out by that process? Just wondering if this issue is occurring when the process is about to exit, or if it was still in the middle of something. Speaking of which, do you know if the process itself still running?

nrusch · September 12, 2014, 10:03pm

No, there’s more that should have been printed. I guess maybe the process was killed by the slave.

Unfortunately I didn’t think to check before restarting the slave… my mistake.

rrussell · September 15, 2014, 1:06pm

Thanks for the additional info. If you do see this happen again, it would be great to know if the rendering process is still running. I’m guessing that the “cancel” operation is either getting stuck when it’s trying to kill the task, or it gets stuck afterwards during cleanup.

Cheers,
Ryan

nrusch · September 16, 2014, 1:09am

It seems like this has been happening in other places around the farm. Here’s the tail of another log (from the rd-082 slave):

2014-09-15 16:24:37: 0: STDOUT: Validating nodes... done 2014-09-15 16:24:37: 0: STDOUT: luma.nukeRender : INFO : Prepping script for render... 2014-09-15 16:24:40: 0: STDOUT: luma.nukeRender : INFO : Resolved output files: 2014-09-15 16:24:40: 0: STDOUT: luma.nukeRender : INFO : filepath.FileSequence(u'xxx', format='database') 2014-09-15 16:24:40: 0: STDOUT: luma.nukeRender : INFO : filepath.FileSequence(u'xxx', format='database') 2014-09-15 16:24:40: 0: STDOUT: luma.nukeRender : INFO : filepath.FileSequence(u'xxx', format='database') 2014-09-15 16:24:40: 0: STDOUT: luma.nukeRender : INFO : Rendering node: xxx 2014-09-15 16:25:13: 0: STDOUT: .6.1.2.5.8.9.2.3.4.5.6.7.8.9 2014-09-15 16:25:13: 0: STDOUT: Writing xxx.dpx took 29.26 seconds 2014-09-15 16:25:14: 0: STDOUT: .4 2014-09-15 16:25:14: 0: STDOUT: Writing xxx.jpg took 0.47 seconds 2014-09-15 16:25:16: 0: STDOUT: .8 2014-09-15 16:25:16: 0: STDOUT: Writing xxx.jpg took 0.30 seconds 2014-09-15 16:39:56: Scheduler Thread - Task "17_1049-1051" could not be found because task has been modified: 2014-09-15 16:39:56: current status = Rendering, new status = Rendering 2014-09-15 16:39:56: current slave = rd-082, new slave = rd-106 2014-09-15 16:39:56: current frames = 1049-1051, new frames = 1049-1051 2014-09-15 16:39:56: Scheduler Thread - Cancelling task... 2014-09-15 16:40:17: Scheduler Thread - Task "17_1049-1051" could not be found because task has been modified: 2014-09-15 16:40:17: current status = Rendering, new status = Rendering 2014-09-15 16:40:17: current slave = rd-082, new slave = rd-106 2014-09-15 16:40:17: current frames = 1049-1051, new frames = 1049-1051 2014-09-15 16:40:17: Scheduler Thread - Cancelling task...

That’s the end of the log. And then 20 minutes later, the slave was marked as stalled (per the settings in our repo). The slave machine was left in an unusable state, and the entire process tree (including the slave process) was still there, but marked defunct.

Here’s the tail end of the launcher log (before the machine was rebooted):

2014-09-15 17:01:46: Checking for stalled slaves on this machine 2014-09-15 17:01:46: - Restarting slave rd-082 because it has stalled 2014-09-15 17:01:46: Sending command to slave: StopSlave 2014-09-15 17:01:56: Got reply: rd-082.luma-pictures.com: Sent "StopSlave" command. Result: "timed out while waiting for reply" 2014-09-15 17:02:26: Slave 'rd-082' failed to exit after 30 seconds, killing it... 2014-09-15 17:07:26: Updating Repository options 2014-09-15 17:07:26: - Remote Administration: enabled 2014-09-15 17:07:26: - Automatic Updates: enabled 2014-09-15 17:07:26: Checking for stalled slaves on this machine 2014-09-15 17:07:26: - Restarting slave rd-082 because it has stalled 2014-09-15 17:07:26: Sending command to slave: StopSlave 2014-09-15 17:07:37: Got reply: rd-082.luma-pictures.com: Sent "StopSlave" command. Result: "timed out while waiting for reply" 2014-09-15 17:08:07: Slave 'rd-082' failed to exit after 30 seconds, killing it... 2014-09-15 17:13:07: Updating Repository options 2014-09-15 17:13:07: - Remote Administration: enabled 2014-09-15 17:13:07: - Automatic Updates: enabled 2014-09-15 17:13:07: Checking for stalled slaves on this machine 2014-09-15 17:13:07: - Restarting slave rd-082 because it has stalled 2014-09-15 17:13:07: Sending command to slave: StopSlave 2014-09-15 17:13:17: Got reply: rd-082.luma-pictures.com: Sent "StopSlave" command. Result: "timed out while waiting for reply" 2014-09-15 17:13:47: Slave 'rd-082' failed to exit after 30 seconds, killing it... 2014-09-15 17:18:47: Updating Repository options 2014-09-15 17:18:47: - Remote Administration: enabled 2014-09-15 17:18:47: - Automatic Updates: enabled 2014-09-15 17:18:47: Checking for stalled slaves on this machine 2014-09-15 17:18:47: - Restarting slave rd-082 because it has stalled 2014-09-15 17:18:47: Sending command to slave: StopSlave 2014-09-15 17:18:57: Got reply: rd-082.luma-pictures.com: Sent "StopSlave" command. Result: "timed out while waiting for reply" 2014-09-15 17:19:27: Slave 'rd-082' failed to exit after 30 seconds, killing it...

nrusch · September 16, 2014, 2:23am

Same thing happened on another machine that hasn’t been restarted in about a week. This was after a manual task re-queue:

2014-09-15 17:17:12: 0: STDOUT: luma.imagepath : WARNING : Skipping invalid knob: 'streamOpt' 2014-09-15 17:17:12: 0: STDOUT: luma.imagepath : WARNING : Skipping invalid knob: 'rfcCompat' 2014-09-15 17:17:12: 0: STDOUT: Writing xxx.mov took 0.17 seconds 2014-09-15 17:17:12: 0: STDOUT: Writing xxx.mov took 0.15 seconds 2014-09-15 17:19:08: 0: STDOUT: 2014-09-15 17:19:07.006 Nuke[79438:4127] This process is attempting to exclude an item from Time Machine by path without administrator privileges. This is not supported. 2014-09-15 17:19:08: 0: STDOUT: 2014-09-15 17:19:07.006 Nuke[79438:383b] This process is attempting to exclude an item from Time Machine by path without administrator privileges. This is not supported. 2014-09-15 19:05:27: Scheduler Thread - Task "0_1-1" could not be found because task has been modified: 2014-09-15 19:05:27: current status = Rendering, new status = Rendering 2014-09-15 19:05:27: current slave = rd-205-07, new slave = rd-206-07 2014-09-15 19:05:27: current frames = 1-1, new frames = 1-1 2014-09-15 19:05:27: Scheduler Thread - Cancelling task... 2014-09-15 19:05:50: Scheduler Thread - Task "0_1-1" could not be found because task has been modified: 2014-09-15 19:05:50: current status = Rendering, new status = Rendering 2014-09-15 19:05:50: current slave = rd-205-07, new slave = rd-206-07 2014-09-15 19:05:50: current frames = 1-1, new frames = 1-1 2014-09-15 19:05:50: Scheduler Thread - Cancelling task... 2014-09-15 19:06:12: Scheduler Thread - Task "0_1-1" could not be found because task has been modified: 2014-09-15 19:06:12: current status = Rendering, new status = Completed 2014-09-15 19:06:12: current slave = rd-205-07, new slave = rd-206-07 2014-09-15 19:06:12: current frames = 1-1, new frames = 1-1 2014-09-15 19:06:12: Scheduler Thread - Cancelling task... 2014-09-15 19:06:33: Scheduler Thread - Task "0_1-1" could not be found because task has been modified: 2014-09-15 19:06:33: current status = Rendering, new status = Completed 2014-09-15 19:06:33: current slave = rd-205-07, new slave = rd-206-07 2014-09-15 19:06:33: current frames = 1-1, new frames = 1-1 2014-09-15 19:06:33: Scheduler Thread - Cancelling task...

The task seemed to be stuck (which you can see from the time stamp gap in the logs after 17:19:08), so I re-queued it. After that, the slave went into the “Cancelling task” loop, where it would likely have stayed indefinitely, except that I shut it down remotely (in this case, the slave process did actually respond to the remote shutdown command, as opposed to the example from my previous post). Also, in this case, the process seems to have been terminated by the slave before I stopped the slave process.

nrusch · September 16, 2014, 2:26am

Also, I went back and looked at the reports for the second-to-last post (about the rd-082 slave), and that was also a case where I manually re-queued a stuck task from the Monitor, after which it went comatose.

I’m a little worried that Deadline has more than a little to do with these tasks sticking, since this didn’t used to happen with Rush.

rrussell · September 16, 2014, 1:40pm

Thanks for the additional logs and info. We have this logged as a bug, and hopefully we can reproduce this so that we can figure out where things are getting stuck.

If you happen to come across any additional info regarding this issue, let us know!

Cheers,
Ryan

LaszloSebo · September 16, 2014, 4:27pm

Which version are you on?

It would be great to add some robustness into cancellation handling, we find stuck slaves as well on occasion. The larger the farm, the more hard to notice these issues.
Whatever loop is it that’s printing those messages, should probably have a fail-safe exit condition.

nrusch · September 16, 2014, 4:34pm

6.2.0.32

nrusch · October 21, 2014, 6:28pm

This has been happening more and more frequently under heavy load. I’m seeing some slightly different behavior now as well where I’ll re-queue a task (because the render is stuck), another slave will pick it up, and the original slave will get stuck trying to switch to something else with a loop like this:

Scheduler Thread - Cancelling task... Scheduler Thread - Task "46_1053-1053" could not be found because task has been modified: current status = Rendering, new status = Rendering current slave = rd-027, new slave = ws-033 current frames = 1053-1053, new frames = 1053-1053 Scheduler Thread - Cancelling task... Scheduler Thread - Task "46_1053-1053" could not be found because task has been modified: current status = Rendering, new status = Rendering current slave = rd-027, new slave = ws-033 current frames = 1053-1053, new frames = 1053-1053 Scheduler Thread - Cancelling task...
and on and on indefinitely.

I don’t understand why the slave can’t abort what it’s doing…

rrussell · October 21, 2014, 7:52pm

We still have not had any luck reproducing this here.

We’re assuming that the reason it can’t abort is because there is a deadlock on the same thread that would check if the task had been cancelled. This will for sure be addressed in Deadline 8 because we can just kill the sandboxed process if it doesn’t cancel in a reasonable amount of time.

However, it would still be nice to try and figure out why this problem is occurring. One way we could do this is to enable debug logging on the plugin that these jobs get stuck for. This can be done by adding the following line to your dlinit file:

DebugLogging=True

Note that this will add a lot of additional text to the logs, but if a task gets stuck with debug logging on, the debug output should give us a sense of what’s happening.

Cheers,
Ryan