AWS Thinkbox Discussion Forums

stalled detection not catching slaves

Hi there,

On occasion we find a slave freezing up during an IO call in the 3dsmax.py pre-render steps:

Only a manual restart of the slave helped. Pulse never even tries to mark the slave as stalled for some reason …
I would expect either the slave itself detect a timeout, or pulse to eventually mark the slave stalled. Neither happened.

Last lines of the active log (note, its customized functionality, these lines are followed by some Directory.Exists calls which is likely where the freeze happened)

2016-11-19 15:28:31:  0: INFO: Uh oh, this is on the network! Lets copy it local to: "C:\Users\scanlinevfx\AppData\Local\Thinkbox\Deadline8\slave\lapro1874\jobsData\5830dc2903d961224c176600"
2016-11-19 15:28:31:  0: INFO: Getting value of SCL_LOCALIZATION_DISABLE_NFS
2016-11-19 15:28:31:  0: INFO: Value of SCL_LOCALIZATION_DISABLE_NFS = 

Then later when this freeze was noticed, and the task manually requeued:

2016-11-20 12:02:08:  BEGIN - LAPRO1874\scanlinevfx
2016-11-20 12:02:08:  Scheduler Thread - Task "44_1019-1019" could not be found because task has been modified:
2016-11-20 12:02:08:      current status = Rendering, new status = Rendering
2016-11-20 12:02:08:      current slave = LAPRO1874, new slave = LAPRO0709
2016-11-20 12:02:08:      current frames = 1019-1019, new frames = 1019-1019
2016-11-20 12:02:08:  Scheduler Thread - Cancelling task...
2016-11-20 12:03:40:  Scheduler Thread - Task "44_1019-1019" could not be found because task has been modified:
2016-11-20 12:03:40:      current status = Rendering, new status = Completed
2016-11-20 12:03:40:      current slave = LAPRO1874, new slave = LAPRO0709
2016-11-20 12:03:40:      current frames = 1019-1019, new frames = 1019-1019
2016-11-20 12:03:40:  Scheduler Thread - Cancelling task...
2016-11-20 12:05:11:  Scheduler Thread - Task "44_1019-1019" could not be found because task has been modified:
2016-11-20 12:05:11:      current status = Rendering, new status = Completed
2016-11-20 12:05:11:      current slave = LAPRO1874, new slave = LAPRO0709
2016-11-20 12:05:11:      current frames = 1019-1019, new frames = 1019-1019
2016-11-20 12:05:11:  Scheduler Thread - Cancelling task...
2016-11-20 12:06:41:  Scheduler Thread - Task "44_1019-1019" could not be found because task has been modified:
2016-11-20 12:06:41:      current status = Rendering, new status = Completed
2016-11-20 12:06:41:      current slave = LAPRO1874, new slave = LAPRO0709
2016-11-20 12:06:41:      current frames = 1019-1019, new frames = 1019-1019
2016-11-20 12:06:41:  Scheduler Thread - Cancelling task...
2016-11-20 12:08:13:  Scheduler Thread - Task "44_1019-1019" could not be found because task has been modified:
2016-11-20 12:08:13:      current status = Rendering, new status = Completed
2016-11-20 12:08:13:      current slave = LAPRO1874, new slave = LAPRO0709
2016-11-20 12:08:13:      current frames = 1019-1019, new frames = 1019-1019
2016-11-20 12:08:13:  Scheduler Thread - Cancelling task...
2016-11-20 12:09:47:  Scheduler Thread - Task "44_1019-1019" could not be found because task has been modified:
2016-11-20 12:09:47:      current status = Rendering, new status = Completed
2016-11-20 12:09:47:      current slave = LAPRO1874, new slave = LAPRO0709
2016-11-20 12:09:47:      current frames = 1019-1019, new frames = 1019-1019
2016-11-20 12:09:47:  Scheduler Thread - Cancelling task...
2016-11-20 12:11:21:  Scheduler Thread - Task "44_1019-1019" could not be found because task has been modified:
2016-11-20 12:11:21:      current status = Rendering, new status = Completed
2016-11-20 12:11:21:      current slave = LAPRO1874, new slave = LAPRO0709
2016-11-20 12:11:21:      current frames = 1019-1019, new frames = 1019-1019
2016-11-20 12:11:21:  Scheduler Thread - Cancelling task...
2016-11-20 12:12:54:  Scheduler Thread - Task "44_1019-1019" could not be found because task has been modified:
2016-11-20 12:12:54:      current status = Rendering, new status = Completed
2016-11-20 12:12:54:      current slave = LAPRO1874, new slave = LAPRO0709
2016-11-20 12:12:54:      current frames = 1019-1019, new frames = 1019-1019
2016-11-20 12:12:54:  Scheduler Thread - Cancelling task...
2016-11-20 12:14:27:  Scheduler Thread - Task "44_1019-1019" could not be found because task has been modified:
2016-11-20 12:14:27:      current status = Rendering, new status = Completed
2016-11-20 12:14:27:      current slave = LAPRO1874, new slave = LAPRO0709
2016-11-20 12:14:27:      current frames = 1019-1019, new frames = 1019-1019
2016-11-20 12:14:27:  Scheduler Thread - Cancelling task...
2016-11-20 12:16:00:  Scheduler Thread - Task "44_1019-1019" could not be found because task has been modified:
2016-11-20 12:16:00:      current status = Rendering, new status = Completed
2016-11-20 12:16:00:      current slave = LAPRO1874, new slave = LAPRO0709
2016-11-20 12:16:00:      current frames = 1019-1019, new frames = 1019-1019
2016-11-20 12:16:00:  Scheduler Thread - Cancelling task...

Eventually the slave process was manually killed and restarted:

2016-11-20 12:20:32:  BEGIN - LAPRO1874\scanlinevfx
2016-11-20 12:20:32:  Deadline Slave 8.0 [v8.0.10.4 Release (c19fd2cef)]
2016-11-20 12:20:36:  Auto Configuration: A ruleset has been received from Pulse
2016-11-20 12:20:37:  Plugin sandboxing will not be used because it is disabled in the Repository Options.
2016-11-20 12:20:37:  Info Thread - Created.
2016-11-20 12:20:37:  Slave 'LAPRO1874' has stalled because of an unexpected exit. Performing house cleaning...

Any advise on how we could avoid having tasks stuck like that till someone notices? May be some timeout settings are not properly configured, etc?
The job’s max task rendertime limit was set to 40 mins. and the global 3dsmax timeouts are at:
Loading 3dsmax: 1000 seconds
Starting job: 3600 seconds
Progress Updates: 8000 seconds

This is with a mixed 8.0.5 (pulse) / 8.0.10 (slaves) environment.

Hey Laszlo,

If you lower the task render time limit, does the task ever timeout? Or is it permanently stuck until it is manually requeued?

It seemed permanently stuck in this case

Hmm I wonder if this is because sandboxing is disabled. There may be some unhandled exception that is crashing the slave’s render thread and that leaves the slave in a dead state where it picks up no jobs. Are you able to cancel the task the slave is working on from the slave itself? Or is the issue only corrected by shutting down/killing the slave?

I cant try on this job anymore (its been requeud/restarted), but if i get another one, ill try to kill it from the slave itself. I don’t think it was able to kill its own thread though, the middle log i posted tells me that the slave noticed that it should not be working on the task anymore, yet it could not kill it.

I’m tempted to reenable sandboxing once your new build this week is rolled out and is solid. Maybe the reasons we disabled it are not relevant in the latest builds.

I confirmed on another job, cancelling through he slave gui doesnt do anything.

Yeah I suspect something is happening to that render thread then. Sandboxing would likely take care of these kinds of issues, however we will look into this on our end as well.

Privacy | Site terms | Cookie preferences