Deadline 5 - crash during render

One slave appears to have gone down in flames at some point today with this on the terminal. Mono 2.10.2 is installed :

0: STDOUT: Rendering frame 29, segment 1/1, pass 1/1.
Stacktrace:

  at (wrapper managed-to-native) System.Diagnostics.Process.CreateProcess_internal (System.Diagnostics.ProcessStartInfo,intptr,intptr,intptr,System.Diagnostics.Process/ProcInfo&) <0xffffffff>
  at System.Diagnostics.Process.Start_noshell (System.Diagnostics.ProcessStartInfo,System.Diagnostics.Process) <0x008d3>
  at System.Diagnostics.Process.Start_common (System.Diagnostics.ProcessStartInfo,System.Diagnostics.Process) <0x00103>
  at System.Diagnostics.Process.Start () <0x00049>
  at (wrapper remoting-invoke-with-check) System.Diagnostics.Process.Start () <0xffffffff>
  at FranticX.Management.ProcessInfo.GetRAMUsage () <0x003f5>
  at Deadline.Plugins.ScriptPluginMonitor.ReadRamValue () <0x001cf>
  at Deadline.Plugins.ScriptPluginMonitor.m_ramMonitorTimer_Elapsed (object,System.Timers.ElapsedEventArgs) <0x00011>
  at System.Timers.Timer.Callback (object) <0x0037b>
  at System.Threading.Timer/Scheduler.TimerCB (object) <0x0012b>
  at (wrapper runtime-invoke) <Module>.runtime_invoke_void__this___object (object,intptr,intptr,intptr) <0xffffffff>

Native stacktrace:

	0   DeadlineSlave                       0x000b6ea9 0x0 + 749225
	1   DeadlineSlave                       0x00006e0e 0x0 + 28174
	2   libSystem.B.dylib                   0x97e5b05b _sigtramp + 43
	3   ???                                 0xffffffff 0x0 + 4294967295
	4   DeadlineSlave                       0x0021f478 0x0 + 2225272
	5   DeadlineSlave                       0x0021fcf5 0x0 + 2227445
	6   DeadlineSlave                       0x001c6376 0x0 + 1860470
	7   ???                                 0x03140818 0x0 + 51644440
	8   ???                                 0x0313f874 0x0 + 51640436
	9   ???                                 0x0313ee14 0x0 + 51637780
	10  ???                                 0x0313ecea 0x0 + 51637482
	11  ???                                 0x0313ec6e 0x0 + 51637358
	12  ???                                 0x03dfca6e 0x0 + 64997998
	13  ???                                 0x03dff200 0x0 + 65008128
	14  ???                                 0x03dfeffa 0x0 + 65007610
	15  ???                                 0x02fd0a84 0x0 + 50137732
	16  ???                                 0x02fd053c 0x0 + 50136380
	17  ???                                 0x0077ac03 0x0 + 7842819
	18  DeadlineSlave                       0x000112c4 0x0 + 70340
	19  DeadlineSlave                       0x001bd83c 0x0 + 1824828
	20  DeadlineSlave                       0x001bed91 0x0 + 1830289
	21  DeadlineSlave                       0x001f5589 0x0 + 2053513
	22  DeadlineSlave                       0x001f7a4e 0x0 + 2062926
	23  DeadlineSlave                       0x001f983a 0x0 + 2070586
	24  DeadlineSlave                       0x001f98da 0x0 + 2070746
	25  DeadlineSlave                       0x0023796b 0x0 + 2324843
	26  DeadlineSlave                       0x00268e87 0x0 + 2526855
	27  libSystem.B.dylib                   0x97e22259 _pthread_start + 345
	28  libSystem.B.dylib                   0x97e220de thread_start + 34

Debug info from gdb:

/usr/bin/gdb: fork: Resource temporarily unavailable
There was an error executing 'arch(1)'; assuming 'i386'.
/tmp/mono-gdb-commands.6QKwqn:1: Error in sourced command file:
unable to debug self

=================================================================
Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries 
used by your application.
=================================================================

The log on the node only contains the following just before the failure.

2011-06-20 12:53:29:  0: STDOUT: LightWave command: wait.
2011-06-20 12:53:29:  0: INFO: Received response: Ready
2011-06-20 12:53:29:  0: INFO: Finished Lightwave Rendering Phase
2011-06-20 12:53:41:  0: Task timeout is disabled.
2011-06-20 12:53:41:  0: Plugin rendering frame(s): 29
2011-06-20 12:53:46:  0: INFO: Starting Lightwave Rendering Phase...
2011-06-20 12:53:46:  0: INFO: Sending command: render
29 29 1
2011-06-20 12:53:49:  0: STDOUT: sendack: Ready
2011-06-20 12:53:49:  0: STDOUT: LightWave command: render.
2011-06-20 12:53:49:  0: STDOUT: sendack: Rendering frame 29
2011-06-20 12:53:49:  0: STDOUT: Allocating frame buffers.
2011-06-20 12:53:49:  0: STDOUT: Allocating segment buffers.
2011-06-20 12:53:49:  0: STDOUT: Frame: 29.
2011-06-20 12:53:49:  0: STDOUT: Segment: 1/1.
2011-06-20 12:53:49:  0: STDOUT: Pass: 1/1.
2011-06-20 12:53:49:  0: STDOUT: Updating geometry.
2011-06-20 12:53:49:  0: STDOUT: Moving WCLookLike_stackedgrid.
2011-06-20 12:53:49:  0: STDOUT: Moving WCLookLike_nosg2_allinone.
2011-06-20 12:53:49:  0: STDOUT: Moving L-Ex_WMap.
2011-06-20 12:53:49:  0: STDOUT: Moving R-Ex_WMap.
2011-06-20 12:53:49:  0: INFO: Received response: Rendering frame 29
2011-06-20 12:53:49:  0: INFO: Sending command: wait
2011-06-20 12:53:49:  0: STDOUT: 2011-06-20 12:53:48.810 ScreamerNet[78018:903] *** __NSAutoreleaseNoPool(): Object 0xf24e700 of class NSCFArray autoreleased with no pool in place - just leaking
2011-06-20 12:53:49:  0: STDOUT: 2011-06-20 12:53:48.811 ScreamerNet[78018:903] *** __NSAutoreleaseNoPool(): Object 0x337ab20 of class NSCFNumber autoreleased with no pool in place - just leaking
2011-06-20 12:53:49:  0: STDOUT: 2011-06-20 12:53:48.812 ScreamerNet[78018:903] *** __NSAutoreleaseNoPool(): Object 0xed78250 of class NSConcreteMutableData autoreleased with no pool in place - just leaking
2011-06-20 12:53:49:  0: STDOUT: 2011-06-20 12:53:48.813 ScreamerNet[78018:903] *** __NSAutoreleaseNoPool(): Object 0xf29e9a0 of class NSCFArray autoreleased with no pool in place - just leaking
2011-06-20 12:54:20:  0: STDOUT: 2011-06-20 12:54:20.391 ScreamerNet[78018:903] *** __NSAutoreleaseNoPool(): Object 0xf2a3830 of class NSCFArray autoreleased with no pool in place - just leaking
2011-06-20 12:54:20:  0: STDOUT: 2011-06-20 12:54:20.391 ScreamerNet[78018:903] *** __NSAutoreleaseNoPool(): Object 0x337ab20 of class NSCFNumber autoreleased with no pool in place - just leaking
2011-06-20 12:54:20:  0: STDOUT: Rendering frame 29, segment 1/1, pass 1/1.

Hmm, do you always see those Screamernet “__NSAutoreleaseNoPool” errors during rendering? I’m just wondering if LW’s leaking caused the system to run out of resources, which then prevented Deadline from starting up another process.

I would expect that restarting the machine would “fix” the problem, but it might be temporary if the system eventually runs out of resources again. If you continue to have this problem with a specific job, maybe try disabling the Screamernet option for it. With this option off, Deadline will just be performing normal LW command line renders, which means that LW is shutdown and restarted between tasks. Yes, this adds some additional overhead to the rendering time, but it also means that all of LW’s memory is flushed between renders.

Cheers,

  • Ryan

I have a suspicion that the Pool errors are coming from Vue xStream 9.5. The rendernode system was fairly badly busted with the shipping release and their latest EEF has only just got my farm Vue 9.5 xStream enabled. The other nodes have been spitting the same errors, but no machine has yet bailed out. I’ll keep an eye on it.