AWS Thinkbox Discussion Forums

deadline slave crashes

This is with the release version, sometimes the slave just crashes (1% of slaves of a job)


Microsoft Visual C++ Runtime Library

Runtime Error!

Program: C:\Program Files\Thinkbox\Deadline6\bin\deadlineslave.exe

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application’s support team for more information.


OK

From the log:

2013-07-16 10:39:56: Scheduler - Successfully dequeued 1 task(s). Returning.
2013-07-16 10:39:56: Scheduler Thread - Synchronizing job auxiliary files from \inferno2\deadline\repository6\jobs\51e585498b1b0f1c28fb360e
2013-07-16 10:39:56: Scheduler Thread - Synchronization time for job files: 15.625 ms
2013-07-16 10:39:56: Scheduler Thread - Synchronizing plugin files from \inferno2\deadline\repository6\plugins\3dsmax
2013-07-16 10:39:56: Scheduler Thread - Synchronization time for plugin files: 343.750 ms
2013-07-16 10:39:57: 0: Got task!
2013-07-16 10:39:57: 0: Plugin will be reloaded because a new job has been loaded, or one of the job files or plugin files has been modified
2013-07-16 10:39:57: Constructor: 3dsmax
2013-07-16 10:39:57: 0: Loaded plugin: 3dsmax
2013-07-16 10:39:57: 0: Task timeout is disabled.
2013-07-16 10:39:57: 0: Loaded job: deadlineConfigTest (51e585498b1b0f1c28fb360e)
2013-07-16 10:39:57: 0: INFO: Executing plugin script C:\Documents and Settings\ScanlineVFX\Local Settings\Application Data\Thinkbox\Deadline6\slave\LAPRO0306\plugins\3dsmax.py
2013-07-16 10:39:57: 0: INFO: About: 3dsmax Plugin for Deadline
2013-07-16 10:39:57: 0: INFO: The job’s environment will be merged with the current environment before rendering
2013-07-16 10:39:57: 0: INFO: Executing job preload script C:\Documents and Settings\ScanlineVFX\Local Settings\Application Data\Thinkbox\Deadline6\slave\LAPRO0306\plugins\JobPreLoad.py
2013-07-16 10:39:57: 0: INFO: JobPreLoad.main

Thats the last active line.

On a task that did not crash, we get this:

0: INFO: About: 3dsmax Plugin for Deadline
0: INFO: The job’s environment will be merged with the current environment before rendering
0: INFO: Executing job preload script C:\Documents and Settings\ScanlineVFX\Local Settings\Application Data\Thinkbox\Deadline6\slave\LAPRO0309\plugins\JobPreLoad.py
0: INFO: JobPreLoad.main
0: INFO: JobPreLoad.unsetEnvironmentVariables
0: WARNING: No Scanline 3ds Max config defined for this job. It will be rendered without a Scanline 3ds Max config.
0: INFO: Start Job called - starting up 3dsmax plugin
0: INFO: Rendering with 3dsmax version: 2012
0: INFO: Build of 3dsmax to force: 64bit

The corresponding lines of the script (i removed 90% of the script, but these are the unchanged first lines, where the crash seems to happen)

[code]import os
import re
import subprocess

from System.IO import *

from scanline import Paths

def unsetEnvironmentVariables(deadlinePlugin):
deadlinePlugin.LogInfo(‘JobPreLoad.unsetEnvironmentVariables’)
… removed stuff from here …

def main(deadlinePlugin):
deadlinePlugin.LogInfo(‘JobPreLoad.main’)
unsetEnvironmentVariables(deadlinePlugin)
… removed stuff from here …[/code]

Interesting. If you take out the LogInfo lines, does it still crash? I wonder if maybe there is a race condition or something with the logging…

Its not 100% reproable, happens only on around 1% of the machines, but ill try to remove them and see what happens

I have a very similar case that i can repro 100%.

Its a crash in the post task script, that somehow crashes the slave as well. Its very odd though, because even though i have to manually remote into the slave that crashes, once i click OK on the visual c++ runtime library error, the slave carries on functioning just fine… (it does fail the task though).

I’ll try to figure out why its not finding that python library (its in a site packages folder that should be available), but in any case, it probably should not hard crash the slave, requiring manual fix :\

Error in RenderTasks: Post task script “\inferno2\projects\common\pipeline\submission_queue\2013_07_18\TST_000_0000_v0045_lse_test_images_render3d_elementTest_155612471\returnFrames.py”: Python Exception: ImportError : No module named MySQLdb (Python.Runtime.PythonException)
Type: <type ‘exceptions.ImportError’>
Value: No module named MySQLdb
Stack Trace:
[’ File “none”, line 197, in main\n’, ’ File “none”, line 54, in submitFiles\n’, ’ File “//s2/exchange/software/managed/pythonScripts/site-packages\scanline\QueueSubmitter.py”, line 19, in \n import scanline.QueueConnection\n’, ’ File “//s2/exchange/software/managed/pythonScripts/site-packages\scanline\QueueConnection.py”, line 9, in \n import MySQLdb\n’]
(System.Exception)
at Deadline.Plugins.ScriptPlugin.RenderTasks(String taskId, Int32 startFrame, Int32 endFrame, String& outMessage, AbortLevel& abortLevel)
at Deadline.Plugins.ScriptPlugin.RenderTasks(String taskId, Int32 startFrame, Int32 endFrame, String& outMessage, AbortLevel& abortLevel)

Thanks for reporting this! Job scripts should never bring down the slave! We’ve logged it as a bug.

I think it might be a generic problem with the python engine,

Looking at our deadline farm right now (its idle right now), 5% of the machines are hung with similar errors (although, there is no visual c++ exception window, just a full white GUI for the slave):

slave#314: been hanging for ~3 days with this line:
0: INFO: Executing plugin script C:\Documents and Settings\ScanlineVFX\Local Settings\Application Data\Thinkbox\Deadline6\slave\LAPRO0314\plugins\3dsmax.py

slave#332: been hanging for ~15 hrs with this line:
0: INFO: Executing plugin script C:\Users\scanlinevfx\AppData\Local\Thinkbox\Deadline6\slave\LAPRO0332\plugins\Python.py

I’ve attached a debugger to one of the ‘white screen’ hanging, ‘last line in log is from python’ slaves, and while without debug info its hard to tell whats happening, it seems to be very active within python26.dll in one of the worker threads:

python26.dll!000000001e039e70()
[Frames below may be incorrect and/or missing, no symbols loaded for python26.dll]
python26.dll!000000001e03a734()
python26.dll!000000001e03b12f()
python26.dll!000000001e03b162()
python26.dll!000000001e0a8b51()
python26.dll!000000001e0fa9d9()
python26.dll!000000001e0fbea7()
python26.dll!000000001e0f5bb9()
python26.dll!000000001e0f7534()
python26.dll!000000001e0fa85b()
python26.dll!000000001e0fbea7()
python26.dll!000000001e0fbf79()
python26.dll!000000001e10a8e0()
clr.dll!000006447f1017c7()
00000644801fb0c3()
000000001c4d9750()
00000644801faffc()
0000000004462060()
00000644801bb518()
000078616d736433()
000000001fd89a80()
000000001dc1ec50()
00005482391e4ec3()

Definitely looks like there is a deadlock in the python engine. Would it be possible to run the slave on, say, 10 machines without the user interface, and check to see if those 10 slaves ever deadlock? You can run the slave on the machines like this:

"%DEADLINE_PATH%\deadlineslave.exe" -nogui

The user interface uses Python as well, so I’m curious to see if things stabilize by taking the UI out of the equation. Whether it does or not, that definitely helps narrow the scope of the problem.

Cheers,

  • Ryan

I am digging deeper into this, and am now monitoring the whole render process through a remote debugger. It seems that right before the crash, when python imports:

c:\program files\thinkbox\deadline6\bin\dlls_CTYPES.PYD

It then imports a different msvcr90.dll , not the one used by python26.dll.

_ctypes.pyd imports: 9.0.21022.8 (and old dll another tool of ours is still using, and can be found in the path env variable)

while python26.dll imports: 9.0.30729.4148

I have a feeling its somehow related to this

Ill try this!

Interesting point about msvcr90.dll being loaded. If that is related, that’s a scenario we can probably set up here to try and reproduce.

Its definitely the mismatched msvcr90.dll, force removing the folder from the environment variables in the post task scripts seems to fix the visual c++ crashes!

Note that the behavior in this case and the other crashes (where the GUI hung) is slightly different. Sadly i fixed all the slaves that were hanging, but if i find another, ill check what modules it has loaded, to see if the offending msvcr90.dll is there as well

Sounds like a plan! Keep us posted.

For now i have added 2 new configs to our python sitecustomize setup (which is triggered by cpython application name):

dpython.exe
deadlineslave.exe

While still defining our standard 2.6 64b python libraries, they also now remove this particular path env variable. Its very odd its loading that dll though,… as deadline’s bin folder is also in the path variable. Oh dll hell, what would life be without you.

Privacy | Site terms | Cookie preferences