machines hanging

LaszloSebo · December 4, 2013, 1:27am

Found about 60 slaves with the following error:

Scheduler Thread - Task “23_1046-1046” could not be found because task has been modified:
current status = Rendering, new status = Queued
current slave = LAPRO0408, new slave =
current frames = 1046-1046, new frames = 1046-1046
Scheduler Thread - Cancelling task…
Scheduler Thread - Task “23_1046-1046” could not be found because task has been modified:
current status = Rendering, new status = Queued
current slave = LAPRO0408, new slave =
current frames = 1046-1046, new frames = 1046-1046
Scheduler Thread - Cancelling task…
Scheduler Thread - Task “23_1046-1046” could not be found because task has been modified:
current status = Rendering, new status = Queued
current slave = LAPRO0408, new slave =
current frames = 1046-1046, new frames = 1046-1046
Scheduler Thread - Cancelling task…
---- 2013/12/03 17:26 ----
Scheduler Thread - Task “23_1046-1046” could not be found because task has been modified:
current status = Rendering, new status = Queued
current slave = LAPRO0408, new slave =
current frames = 1046-1046, new frames = 1046-1046
Scheduler Thread - Cancelling task…
Scheduler Thread - Task “23_1046-1046” could not be found because task has been modified:
current status = Rendering, new status = Queued
current slave = LAPRO0408, new slave =
current frames = 1046-1046, new frames = 1046-1046
Scheduler Thread - Cancelling task…
Connecting to slave log: LAPRO0408

LaszloSebo · December 4, 2013, 1:37am

When trying to restart the slave, they do this:

2013-12-03 17:36:12: Waiting for threads to quit. 2 seconds until forced shutdown.
2013-12-03 17:36:12: 0: Shutdown
2013-12-03 17:36:13: 0: Shutdown
2013-12-03 17:36:13: 0: Shutdown
2013-12-03 17:36:13: 0: Shutdown
2013-12-03 17:36:13: 0: Shutdown
2013-12-03 17:36:13: Waiting for threads to quit. 1 seconds until forced shutdown.
2013-12-03 17:36:14: 0: Shutdown
2013-12-03 17:36:14: 0: Shutdown
2013-12-03 17:36:14: 0: Shutdown
2013-12-03 17:36:14: 0: Shutdown
2013-12-03 17:36:14: Waiting for threads to quit. 0 seconds until forced shutdown.
2013-12-03 17:36:14: Info Thread: Stopped
2013-12-03 17:36:14: Scheduler Thread: ShuttingDown / Waiting
2013-12-03 17:36:14: Render Threads: StartJob
2013-12-03 17:36:14: Forcing shutdown.
2013-12-03 17:36:14: Exception Details
2013-12-03 17:36:14: Exception – One or more threads failed to quit successfully.
2013-12-03 17:36:14: Exception.Data: ( )
2013-12-03 17:36:14: Exception.StackTrace:
2013-12-03 17:36:14: (null)
2013-12-03 17:36:14: Slave - slave shutdown: forced
2013-12-03 17:36:14: Scheduler Thread - Task “25_1026-1026” could not be found because task has been modified:
2013-12-03 17:36:14: current status = Rendering, new status = Completed
2013-12-03 17:36:14: current slave = LAPRO0516, new slave = LAPRO0481
2013-12-03 17:36:14: current frames = 1026-1026, new frames = 1026-1026
2013-12-03 17:36:14: Scheduler Thread - Cancelling task…

And the process remains, never to quit. We have to manually kill the 3dsmax.exe process, or reboot the machine

LaszloSebo · August 20, 2014, 6:25pm

This problem is still happening btw.

2014-08-19 07:09:00: 0: PYTHON: Success!
2014-08-19 07:09:00: 0: INFO: JobPreLoad.main version: 1.57
2014-08-19 07:09:00: 0: INFO: RUN SCENEJANITOR
2014-08-19 07:49:04: 0: Slave timed out – canceling current task…
2014-08-19 09:38:20: Scheduler Thread - Task “92_1072-1072” could not be found because task has been modified:
2014-08-19 09:38:20: current status = Rendering, new status = Suspended
2014-08-19 09:38:20: current slave = LAPRO0874, new slave =
2014-08-19 09:38:20: current frames = 1072-1072, new frames = 1072-1072
2014-08-19 09:38:20: Scheduler Thread - Cancelling task…
2014-08-19 09:38:41: Scheduler Thread - Task “92_1072-1072” could not be found because task has been modified:
2014-08-19 09:38:41: current status = Rendering, new status = Suspended
2014-08-19 09:38:41: current slave = LAPRO0874, new slave =
2014-08-19 09:38:41: current frames = 1072-1072, new frames = 1072-1072
2014-08-19 09:38:41: Scheduler Thread - Cancelling task…
2014-08-19 09:39:05: Scheduler Thread - Task “92_1072-1072” could not be found because task has been modified:
2014-08-19 09:39:05: current status = Rendering, new status = Suspended
2014-08-19 09:39:05: current slave = LAPRO0874, new slave =

A subprocess hung up and deadline is unable to handle that. THe machine has been hanging for a day now

LaszloSebo · August 20, 2014, 6:28pm

In the job preload script we have these lines:

def __main__(deadlinePlugin):
    deadlinePlugin.LogInfo('JobPreLoad.__main__ version: %s' % __version__)
    
    deadlinePlugin.LogInfo('RUN SCENEJANITOR')
    os.system(os.path.join(Paths.pythonScripts(),'drive_janitor\\clean_render_node_drives.bat').replace('/',os.path.sep).replace('\\',os.path.sep))
    deadlinePlugin.LogInfo('Done, wrangling env variables')
    unsetEnvironmentVariables(deadlinePlugin)

As you can see it seems like it hung up on the os.system call, never to return from that and deadline cant kill it

rrussell · August 20, 2014, 6:39pm

In theory, this is something we would be able to detect in version 8 when we sandbox the python environment, since the python code will now be running as a separate process. In fact, this will likely just be caught by the task timeout functionality.

For now though, rather than run os.system, could you use subprocess instead? Then you could give the process a certain amount of time to run before killing it and moving on.

LaszloSebo · August 20, 2014, 6:52pm

Yeah, im refactoring the code now. It probably isnt sufficient to just have a wait, i should probably spawn a subprocess and do my own timeout handling, correct?

Adding a wait might just block the code at a later point?

rrussell · August 20, 2014, 7:01pm

Yeah, that’s what I was thinking.

LaszloSebo · August 21, 2014, 4:59pm

This is happening very often btw, on various jobs (nuke, maya, max etc). Any subprocess we start can basically indefinitely hang up deadline, ultimately resulting in deadline being unreliable as a render manager, as it cant really manage the jobs its starting.

That encapsulated python environment is more important i think than you guys realize

Anyway, this wasn’t a subprocess, all this is within nuke:

2014-08-20 16:11:11:  0: STDOUT: Wed Aug 20 16:11:10 2014 (+0ms)     :(localizeReads): Nuke localizer version: 0.52 : Localizing nodes...
2014-08-20 16:11:11:  0: STDOUT: removing localizePathFilter filter
2014-08-20 16:11:11:  0: STDOUT: Wed Aug 20 16:11:10 2014 (+0ms)     :(localizeReads): Frame Range: 1060-1063, 1
2014-08-20 16:11:11:  0: STDOUT: Wed Aug 20 16:11:10 2014 (+0ms)     :(localizeReads): Root Frame: 1067
2014-08-20 16:11:11:  0: STDOUT: Wed Aug 20 16:11:10 2014 (+0ms)     :(localizeReads): Nuke localizer version: calling localizeNodes()
2014-08-20 16:11:11:  0: STDOUT: WritePath found: //inferno2/projects/gold/scenes/SHR_shr_rsrc/2d/slapcomps/SHR_shr_rsrc_2d_slapcomps_vehbrigeaglea_v0005/linear_2112x1188x1/SHR_shr_rsrc_2d_slapcomps_vehbrigeaglea_v0005.1060.exr
2014-08-20 16:11:11:  0: STDOUT: WritePath found: //inferno2/projects/gold/scenes/SHR_shr_rsrc/2d/slapcomps/SHR_shr_rsrc_2d_slapcomps_vehbrigeaglea_v0005/linear_2112x1188x1/SHR_shr_rsrc_2d_slapcomps_vehbrigeaglea_v0005.1061.exr
2014-08-20 16:11:11:  0: STDOUT: WritePath found: //inferno2/projects/gold/scenes/SHR_shr_rsrc/2d/slapcomps/SHR_shr_rsrc_2d_slapcomps_vehbrigeaglea_v0005/linear_2112x1188x1/SHR_shr_rsrc_2d_slapcomps_vehbrigeaglea_v0005.1062.exr
2014-08-20 16:11:11:  0: STDOUT: WritePath found: //inferno2/projects/gold/scenes/SHR_shr_rsrc/2d/slapcomps/SHR_shr_rsrc_2d_slapcomps_vehbrigeaglea_v0005/linear_2112x1188x1/SHR_shr_rsrc_2d_slapcomps_vehbrigeaglea_v0005.1063.exr
2014-08-20 16:26:59:  0: Slave timed out -- canceling current task...
2014-08-20 16:47:05:  Scheduler Thread - Task "15_1060-1063" could not be found because task has been modified:
2014-08-20 16:47:05:      current status = Rendering, new status = Failed
2014-08-20 16:47:05:      current slave = LAPRO1344-secondary, new slave = 
2014-08-20 16:47:05:      current frames = 1060-1063, new frames = 1060-1063
2014-08-20 16:47:05:  Scheduler Thread - Cancelling task...
2014-08-20 16:47:29:  Scheduler Thread - Task "15_1060-1063" could not be found because task has been modified:
2014-08-20 16:47:29:      current status = Rendering, new status = Failed
2014-08-20 16:47:29:      current slave = LAPRO1344-secondary, new slave = 
2014-08-20 16:47:29:      current frames = 1060-1063, new frames = 1060-1063
2014-08-20 16:47:29:  Scheduler Thread - Cancelling task...
2014-08-20 16:47:53:  Scheduler Thread - Task "15_1060-1063" could not be found because task has been modified:
2014-08-20 16:47:53:      current status = Rendering, new status = Failed
2014-08-20 16:47:53:      current slave = LAPRO1344-secondary, new slave = 
2014-08-20 16:47:53:      current frames = 1060-1063, new frames = 1060-1063
2014-08-20 16:47:53:  Scheduler Thread - Cancelling task...
2014-08-20 16:48:15:  Scheduler Thread - Task "15_1060-1063" could not be found because task has been modified:
2014-08-20 16:48:15:      current status = Rendering, new status = Failed
2014-08-20 16:48:15:      current slave = LAPRO1344-secondary, new slave = 
2014-08-20 16:48:15:      current frames = 1060-1063, new frames = 1060-1063

Machine has been hanging for hours before someone noticed.

rrussell · August 21, 2014, 5:16pm

It’s our top priority for version 8, and we’ve already started the design phase for it. Unfortunately, this is a MAJOR change, so it’s not something we can quickly squeeze into v7. It even has us considering a refactoring of the Deadline plugin architecture.

We completely understand that this is a major inconvenience, but we have to implement a change like this properly, otherwise we risk destabilizing things even more.

Cheers,
Ryan

LaszloSebo · August 22, 2014, 4:10pm

Understood…

Sadly, we now have to have scripts like this (to reload modules constantly):

import scl.pipeline.scanPath as scanPath
reload(scanPath)
import scl.nuke.publishUtils as publishUtils
reload(publishUtils)
import scl.shotgun.createVersion as createVersion
reload(createVersion)
import scl.nuke.makeProxyQuicktime as makeProxyQuicktime
reload(makeProxyQuicktime)
from scl.pipeline import scLog
reload(scLog)
import scanline.Paths as Paths
reload(Paths)
import scl.nuke.getSequence as getSequence
reload(getSequence)
import scl.shotgun.getsettings as getsettings
reload(getsettings)
import shootBotLib2
reload(shootBotLib2)

And scripts like this (to clear leftover env variables from previous jobs):

def unsetEnvironmentVariables(deadlinePlugin):
    deadlinePlugin.LogInfo('JobPreLoad.unsetEnvironmentVariables')
    aEnvironmentVariablesToUnset = []
    oSclConfigRegex = re.compile(r'^(scl_config_)(.*?)$', re.IGNORECASE)
    for key in os.environ.keys():
        if oSclConfigRegex.match(key):
            deadlinePlugin.LogInfo('Delete: %s' % key) 
            del os.environ[key]
    if ('PYTHONPATH' in os.environ.keys()):
        del os.environ['PYTHONPATH']

Bend-aids all over the place…

Imagine you used 3dsmax, but you couldn’t shut it down. You opened it on monday and would have it running for 2-3 weeks straight, but working on various shows, scenes, versions in the same session. And you couldn’t reset your scene, instead you would just delete all the objects and then merge in the next scene. It would leave the material editor entries there, global variables, script configs etc. So much sadness.