AWS Thinkbox Discussion Forums

slave crash

This is a new one, slave has been saying that its rendering for 4+ days:

Capture.PNG

Current log is full of this:

Listener Thread - fe80::e837:bb50:7f0:f9f7%18 has connected
Listener Thread - Received message: SlaveStillRunning
Listener Thread - Responded with: Success: Yes
Listener Thread - fe80::e837:bb50:7f0:f9f7%18 has connected
Listener Thread - Received message: SlaveStillRunning
Listener Thread - Responded with: Success: Yes
Listener Thread - fe80::e837:bb50:7f0:f9f7%18 has connected
Listener Thread - Received message: SlaveStillRunning
Listener Thread - Responded with: Success: Yes
Listener Thread - fe80::e837:bb50:7f0:f9f7%18 has connected
Listener Thread - Received message: SlaveStillRunning
Listener Thread - Responded with: Success: Yes
Listener Thread - fe80::e837:bb50:7f0:f9f7%18 has connected
Listener Thread - Received message: SlaveStillRunning
Listener Thread - Responded with: Success: Yes
Listener Thread - fe80::e837:bb50:7f0:f9f7%18 has connected
Listener Thread - Received message: SlaveStillRunning
Listener Thread - Responded with: Success: Yes
Listener Thread - fe80::e837:bb50:7f0:f9f7%18 has connected
Listener Thread - Received message: SlaveStillRunning
Listener Thread - Responded with: Success: Yes
Listener Thread - fe80::e837:bb50:7f0:f9f7%18 has connected
Listener Thread - Received message: SlaveStillRunning
Listener Thread - Responded with: Success: Yes
Connecting to slave log: LAPRO1652
Listener Thread - fe80::e837:bb50:7f0:f9f7%18 has connected
Listener Thread - Received message: SlaveStillRunning
Listener Thread - Responded with: Success: Yes
Listener Thread - fe80::e837:bb50:7f0:f9f7%18 has connected
Listener Thread - Received message: StreamLog
Listener Thread - Responded with: Success
Listener Thread - fe80::e837:bb50:7f0:f9f7%18 has connected
Listener Thread - Received message: SlaveStillRunning

Crash log from a couple of days ago:

2014-10-10 15:22:34:  0: WARNING: Monitored managed process 3dsmaxProcess is no longer running
2014-10-10 15:22:36:  0: An exception occurred: Error in RenderTasks: RenderTask: Unexpected exception (Monitored managed process "3dsmaxProcess" has exited or been terminated.
2014-10-10 15:22:36:   current Cache. Totally Processed 1663 Buckets of 1943, Skipped:,Processed Elements: 0 of 498735273
2014-10-10 15:22:36:  2014/10/10 15:22:28 DBG: [Flowline-DEBUG] ImplicitElement_CascadeSpray|ImplicitSurface : Block: KDTree 5 Processing 85.537827 Percent of current Cache. Totally Processed 1663 Buckets of 1943, Skipped:,Processed Elements: 0 of 498735273
2014-10-10 15:22:36:  2014/10/10 15:22:28 DBG: [Flowline-DEBUG] ImplicitElement_CascadeSpray|ImplicitSurface : Block: KDTree 0 Processing 85.537827 Percent of current Cache. Totally Processed 1663 Buckets of 1943, Skipped:,Processed Elements: 0 of 498735273
2014-10-10 15:22:36:  2014/10/10 15:22:28 DBG: [Flowline-DEBUG] ImplicitElement_CascadeSpray|ImplicitSurface : Block: KDTree 4 Processing 85.537827 Percent of current Cache. Totally Processed 1663 Buckets of 1943, Skipped:,Processed Elements: 0 of 498735273
2014-10-10 15:22:36:  2014/10/10 15:22:28 DBG: [Flowline-DEBUG] ImplicitElement_CascadeSpray|ImplicitSurface : Block: KDTree 3 Processing 85.537827 Percent of current Cache. Totally Processed 1663 Buckets of 1943, Skipped:,Processed Elements: 0 of 498735273
2014-10-10 15:22:36:  2014/10/10 15:22:28 DBG: [Flowline-DEBUG] ImplicitElement_CascadeSpray|ImplicitSurface : Block: KDTree 1 Processing 85.537827 Percent of current Cache. Totally Processed 1663 Buckets of 1943, Skipped:,Processed Elements: 0 of 498735273
2014-10-10 15:22:36:  2014/10/10 15:22:28 DBG: [Flowline-DEBUG] ImplicitElement_CascadeSpray|ImplicitSurface : Block: KDTree 10 Processing 85.537827 Percent of current Cache. Totally Processed 1663 Buckets of 1943, Skipped:,Processed Elements: 0 of 498735273
2014-10-10 15:22:36:  2014/10/10 15:22:29 ERR: [Flowline-ERROR] Element_CasSpray|Elements : Could not get enough memory for elements, probably fragmented memory
2014-10-10 15:22:36:  2014/10/10 15:22:32 ERR: An unexpected exception has occurred in the network renderer and it is terminating.
2014-10-10 15:22:36:  )
2014-10-10 15:22:36:     at Deadline.Plugins.ScriptPlugin.RenderTasks(String taskId, Int32 startFrame, Int32 endFrame, String& outMessage, AbortLevel& abortLevel) (Deadline.Plugins.RenderPluginException)
2014-10-10 15:22:36:  0: Unloading plugin: 3dsmax
2014-10-10 15:22:37:  Scheduler Thread - Render Thread 0 threw a major error: 
2014-10-10 15:22:37:  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2014-10-10 15:22:37:  Exception Details
2014-10-10 15:22:37:  RenderPluginException -- Error in RenderTasks: RenderTask: Unexpected exception (Monitored managed process "3dsmaxProcess" has exited or been terminated.
2014-10-10 15:22:37:   current Cache. Totally Processed 1663 Buckets of 1943, Skipped:,Processed Elements: 0 of 498735273
2014-10-10 15:22:37:  2014/10/10 15:22:28 DBG: [Flowline-DEBUG] ImplicitElement_CascadeSpray|ImplicitSurface : Block: KDTree 5 Processing 85.537827 Percent of current Cache. Totally Processed 1663 Buckets of 1943, Skipped:,Processed Elements: 0 of 498735273
2014-10-10 15:22:37:  2014/10/10 15:22:28 DBG: [Flowline-DEBUG] ImplicitElement_CascadeSpray|ImplicitSurface : Block: KDTree 0 Processing 85.537827 Percent of current Cache. Totally Processed 1663 Buckets of 1943, Skipped:,Processed Elements: 0 of 498735273
2014-10-10 15:22:37:  2014/10/10 15:22:28 DBG: [Flowline-DEBUG] ImplicitElement_CascadeSpray|ImplicitSurface : Block: KDTree 4 Processing 85.537827 Percent of current Cache. Totally Processed 1663 Buckets of 1943, Skipped:,Processed Elements: 0 of 498735273
2014-10-10 15:22:37:  2014/10/10 15:22:28 DBG: [Flowline-DEBUG] ImplicitElement_CascadeSpray|ImplicitSurface : Block: KDTree 3 Processing 85.537827 Percent of current Cache. Totally Processed 1663 Buckets of 1943, Skipped:,Processed Elements: 0 of 498735273
2014-10-10 15:22:37:  2014/10/10 15:22:28 DBG: [Flowline-DEBUG] ImplicitElement_CascadeSpray|ImplicitSurface : Block: KDTree 1 Processing 85.537827 Percent of current Cache. Totally Processed 1663 Buckets of 1943, Skipped:,Processed Elements: 0 of 498735273
2014-10-10 15:22:37:  2014/10/10 15:22:28 DBG: [Flowline-DEBUG] ImplicitElement_CascadeSpray|ImplicitSurface : Block: KDTree 10 Processing 85.537827 Percent of current Cache. Totally Processed 1663 Buckets of 1943, Skipped:,Processed Elements: 0 of 498735273
2014-10-10 15:22:37:  2014/10/10 15:22:29 ERR: [Flowline-ERROR] Element_CasSpray|Elements : Could not get enough memory for elements, probably fragmented memory
2014-10-10 15:22:37:  2014/10/10 15:22:32 ERR: An unexpected exception has occurred in the network renderer and it is terminating.
2014-10-10 15:22:37:  )
2014-10-10 15:22:37:     at Deadline.Plugins.ScriptPlugin.RenderTasks(String taskId, Int32 startFrame, Int32 endFrame, String& outMessage, AbortLevel& abortLevel)
2014-10-10 15:22:37:  RenderPluginException.Cause: JobError (2)
2014-10-10 15:22:37:  RenderPluginException.Level: Major (1)
2014-10-10 15:22:37:  RenderPluginException.HasSlaveLog: True
2014-10-10 15:22:37:  Exception.Data: ( )
2014-10-10 15:22:37:  Exception.TargetSite: Void RenderTask(System.String, Int32, Int32)
2014-10-10 15:22:37:  Exception.Source: deadline
2014-10-10 15:22:37:  Exception.HResult: -2146233088
2014-10-10 15:22:37:    Exception.StackTrace: 
2014-10-10 15:22:37:     at Deadline.Plugins.Plugin.RenderTask(String taskId, Int32 startFrame, Int32 endFrame)
2014-10-10 15:22:37:     at Deadline.Slaves.SlaveRenderThread.RenderCurrentTask(TaskLogWriter tlw)
2014-10-10 15:22:37:  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2014-10-10 15:22:39:  Scheduler Thread - Seconds before next job scan: 2
2014-10-10 15:22:44:  Scheduler Thread - exception occurred:

Seems like there was a max crash, then the scheduled had an exception tooā€¦
Stalled slave detection didnā€™t catch it.

Clicking ok on that exception window made the slave instantly disappear.

Looks like the slaveā€™s info thread was still running, which was why it didnā€™t appear as stalled (the info thread is still reporting the slaveā€™s state on a regular basis). Looks like the slaveā€™s memory may have become corrupt.

I wonder if we could have the info thread check the state of the other threads, and try to clean things up and report a status of stalled. Hard to say if that would even help in this case if the slaveā€™s memory is indeed corruptā€¦

Might add another layer of robustness, we have found a couple of cases so far where this could happen (some threads frozen, some are active, slave never gets ā€˜stalledā€™).

A note about these particular crashes:
If i sort the slaves by status, anything thats over 10hrs is most likely in this state, currently about 30 or so machines. As i was remoting into these boxes, i found that some have run out of resources and/or were rendering jobs with rather high ram usage.
Sadly, thatā€™s not something we can fix, those ā€œstray jobsā€ will always happen.

The problem is that deadline is expected to be ā€˜rock solidā€™, so a machine running low on ram should not affect its behavior. I could be wrong, but my guess is that a managed programming language like c# will never be a good foundation for this expectation, since you are not in direct control of memory management.

Out of curiosity, on the machines that in this state, what was typically the last thing the slave did in the log before the crash? Was it try to report a render error like in this case, or after a task completes, or is it basically random?

The reason I ask is that there is a known issue in Deadline 6 where the slave can use up a lot of RAM when creating a job report for a task with a very long render log. This can cause the slave to balloon up in size temporarily, and if the systemā€™s resources are already strained, can result in this memory problem. We have fixed this issue in Deadline 7, so the slave will no longer eat up lots of RAM when saving the reports.

C# has a class that can be used to help avoid memory exceptions:
msdn.microsoft.com/en-us/library ā€¦ 00%29.aspx

That could be something we can consider looking at when doing the Python sandboxing for Deadline 8. The nice thing about this sandboxing is that most of the memory fluctuation will now be in the sandboxed process instead of the main Slave process.

Cheers,
Ryan

The cases i looked at were all right after max crashes. Whether its the log writing though, its hard to tell.

Here is another example (its weird that the slave is still interactive, i can open logs, use the menus etcā€¦ till i click ok on that ā€œA new guard page etcā€ error).
In this case it looks like its some mongo communication:

2014-10-12 22:45:37:  0: INFO: Building motion blur raycast accelerator...: done [00:00:00.7]
2014-10-12 22:45:37:  0: INFO: Preparing direct light manager...
2014-10-12 22:45:37:  0: INFO: Preparing global light manager...
2014-10-12 23:44:14:  0: WARNING: Monitored managed process 3dsmaxProcess is no longer running
2014-10-12 23:44:26:  0: An exception occurred: Error in RenderTasks: RenderTask: Unexpected exception (Monitored managed process "3dsmaxProcess" has exited or been terminated.
2014-10-12 23:44:26:   current Cache. Totally Processed 591 Buckets of 591, Skipped:,Processed Elements: 0 of 38416983
2014-10-12 23:44:26:  2014/10/12 23:08:07 DBG: [Flowline-DEBUG] ImplicitElement_CascadeMist|ImplicitSurface : Block: KDTree 18 Processing 99.830795 Percent of current Cache. Totally Processed 591 Buckets of 591, Skipped:,Processed Elements: 0 of 38416983
2014-10-12 23:44:26:  2014/10/12 23:08:07 DBG: [Flowline-DEBUG] ImplicitElement_CascadeMist|ImplicitSurface : Block: KDTree 16 Processing 99.830795 Percent of current Cache. Totally Processed 591 Buckets of 591, Skipped:,Processed Elements: 0 of 38416983
2014-10-12 23:44:26:  2014/10/12 23:08:07 DBG: [Flowline-DEBUG] ImplicitElement_CascadeMist|ImplicitSurface : Block: KDTree 19 Processing 99.830795 Percent of current Cache. Totally Processed 591 Buckets of 591, Skipped:,Processed Elements: 0 of 38416983
2014-10-12 23:44:26:  2014/10/12 23:08:07 DBG: [Flowline-DEBUG] ImplicitElement_CascadeMist|ImplicitSurface : Block: KDTree 10 Processing 99.830795 Percent of current Cache. Totally Processed 591 Buckets of 591, Skipped:,Processed Elements: 0 of 38416983
2014-10-12 23:44:26:  2014/10/12 23:08:07 INF: [Flowline-TRACE] Element_CasMist|172.18.0.85 SC_FileAccessTCP_endConnection_Send: Failed to send data!
2014-10-12 23:44:26:  2014/10/12 23:08:07 INF: [Flowline-TRACE] Element_CasMist|172.18.0.85 SC_FileAccessTCP_endConnection_Send: Failed to send data!
2014-10-12 23:44:26:  2014/10/12 23:08:17 DBG: [Flowline-DEBUG] ImportMesh_WaveShellSharp|Mesh : ImplicitBlockData invalid for this frame
2014-10-12 23:44:26:  2014/10/12 23:08:17 DBG: [Flowline-DEBUG] ImportMesh_WaveShellSharp|Mesh : BeforeGetIsoPolygon, this is synced
2014-10-12 23:44:26:  2014/10/12 23:08:17 DBG: [Flowline-DEBUG] ImportMesh_WaveShellSharp|Mesh : IsoPolyValidTime is not Valid
2014-10-12 23:44:26:  2014/10/12 23:08:17 DBG: [Flowline-DEBUG] ImplicitShadow_Fill|ImplicitSurface_Modified : Creating Implicit Shadow Structure
2014-10-12 23:44:26:  2014/10/12 23:08:17 DBG: [Flowline-DEBUG] ImplicitShadow_Fill|ImplicitSurface_Modified : Creating Implicit Shadow Structure
2014-10-12 23:44:26:  2014/10/12 23:08:17 DBG: [Flowline-DEBUG] ImplicitShadow_Fill|ImplicitSurface_Modified : Creating Implicit Shadow Structure
2014-10-12 23:44:26:  2014/10/12 23:44:13 ERR: [V-Ray] UNHANDLED EXCEPTION: Rendering region (1344,784)-(1352,792) Last marker is at ./src/globillummap.cpp, line 1526: GlobalIllumMap::buildLightMap() {1} 
2014-10-12 23:44:26:  )
2014-10-12 23:44:26:     at Deadline.Plugins.ScriptPlugin.RenderTasks(String taskId, Int32 startFrame, Int32 endFrame, String& outMessage, AbortLevel& abortLevel) (Deadline.Plugins.RenderPluginException)
2014-10-12 23:44:32:  0: Unloading plugin: 3dsmax
2014-10-12 23:44:34:  Scheduler Thread - Render Thread 0 threw a major error: 
2014-10-12 23:44:34:  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2014-10-12 23:44:37:  Exception Details
2014-10-12 23:44:37:  RenderPluginException -- Error in RenderTasks: RenderTask: Unexpected exception (Monitored managed process "3dsmaxProcess" has exited or been terminated.
2014-10-12 23:44:37:   current Cache. Totally Processed 591 Buckets of 591, Skipped:,Processed Elements: 0 of 38416983
2014-10-12 23:44:37:  2014/10/12 23:08:07 DBG: [Flowline-DEBUG] ImplicitElement_CascadeMist|ImplicitSurface : Block: KDTree 18 Processing 99.830795 Percent of current Cache. Totally Processed 591 Buckets of 591, Skipped:,Processed Elements: 0 of 38416983
2014-10-12 23:44:37:  2014/10/12 23:08:07 DBG: [Flowline-DEBUG] ImplicitElement_CascadeMist|ImplicitSurface : Block: KDTree 16 Processing 99.830795 Percent of current Cache. Totally Processed 591 Buckets of 591, Skipped:,Processed Elements: 0 of 38416983
2014-10-12 23:44:37:  2014/10/12 23:08:07 DBG: [Flowline-DEBUG] ImplicitElement_CascadeMist|ImplicitSurface : Block: KDTree 19 Processing 99.830795 Percent of current Cache. Totally Processed 591 Buckets of 591, Skipped:,Processed Elements: 0 of 38416983
2014-10-12 23:44:37:  2014/10/12 23:08:07 DBG: [Flowline-DEBUG] ImplicitElement_CascadeMist|ImplicitSurface : Block: KDTree 10 Processing 99.830795 Percent of current Cache. Totally Processed 591 Buckets of 591, Skipped:,Processed Elements: 0 of 38416983
2014-10-12 23:44:37:  2014/10/12 23:08:07 INF: [Flowline-TRACE] Element_CasMist|172.18.0.85 SC_FileAccessTCP_endConnection_Send: Failed to send data!
2014-10-12 23:44:37:  2014/10/12 23:08:07 INF: [Flowline-TRACE] Element_CasMist|172.18.0.85 SC_FileAccessTCP_endConnection_Send: Failed to send data!
2014-10-12 23:44:37:  2014/10/12 23:08:17 DBG: [Flowline-DEBUG] ImportMesh_WaveShellSharp|Mesh : ImplicitBlockData invalid for this frame
2014-10-12 23:44:37:  2014/10/12 23:08:17 DBG: [Flowline-DEBUG] ImportMesh_WaveShellSharp|Mesh : BeforeGetIsoPolygon, this is synced
2014-10-12 23:44:37:  2014/10/12 23:08:17 DBG: [Flowline-DEBUG] ImportMesh_WaveShellSharp|Mesh : IsoPolyValidTime is not Valid
2014-10-12 23:44:37:  2014/10/12 23:08:17 DBG: [Flowline-DEBUG] ImplicitShadow_Fill|ImplicitSurface_Modified : Creating Implicit Shadow Structure
2014-10-12 23:44:37:  2014/10/12 23:08:17 DBG: [Flowline-DEBUG] ImplicitShadow_Fill|ImplicitSurface_Modified : Creating Implicit Shadow Structure
2014-10-12 23:44:37:  2014/10/12 23:08:17 DBG: [Flowline-DEBUG] ImplicitShadow_Fill|ImplicitSurface_Modified : Creating Implicit Shadow Structure
2014-10-12 23:44:37:  2014/10/12 23:44:13 ERR: [V-Ray] UNHANDLED EXCEPTION: Rendering region (1344,784)-(1352,792) Last marker is at ./src/globillummap.cpp, line 1526: GlobalIllumMap::buildLightMap() {1} 
2014-10-12 23:44:37:  )
2014-10-12 23:44:37:     at Deadline.Plugins.ScriptPlugin.RenderTasks(String taskId, Int32 startFrame, Int32 endFrame, String& outMessage, AbortLevel& abortLevel)
2014-10-12 23:44:37:  RenderPluginException.Cause: JobError (2)
2014-10-12 23:44:37:  RenderPluginException.Level: Major (1)
2014-10-12 23:44:37:  RenderPluginException.HasSlaveLog: True
2014-10-12 23:44:37:  Exception.Data: ( )
2014-10-12 23:44:37:  Exception.TargetSite: Void RenderTask(System.String, Int32, Int32)
2014-10-12 23:44:37:  Exception.Source: deadline
2014-10-12 23:44:37:  Exception.HResult: -2146233088
2014-10-12 23:44:37:    Exception.StackTrace: 
2014-10-12 23:44:37:     at Deadline.Plugins.Plugin.RenderTask(String taskId, Int32 startFrame, Int32 endFrame)
2014-10-12 23:44:37:     at Deadline.Slaves.SlaveRenderThread.RenderCurrentTask(TaskLogWriter tlw)
2014-10-12 23:44:37:  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2014-10-12 23:44:45:  An error occurred while saving job report: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadline.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
2014-10-12 23:44:45:  Exception of type 'System.OutOfMemoryException' was thrown. (System.OutOfMemoryException)
2014-10-12 23:44:45:     at MongoDB.Driver.Internal.MongoUpdateMessage.WriteHeaderTo(BsonBuffer buffer)
2014-10-12 23:44:45:     at MongoDB.Driver.Operations.UpdateOpcodeOperation.Execute(MongoConnection connection)
2014-10-12 23:44:45:     at MongoDB.Driver.MongoCollection.Update(IMongoQuery query, IMongoUpdate update, MongoUpdateOptions options)
2014-10-12 23:44:45:     at Deadline.StorageDB.MongoDB.MongoJobStorage.Internal_AddJobReport(String jobID, Report report, String reportLog) (FranticX.Database.DatabaseConnectionException)
2014-10-12 23:44:48:  Scheduler Thread - Seconds before next job scan: 2
2014-10-12 23:44:53:  Scheduler Thread - exception occurred:

Thanks! This definitely indicates that itā€™s due to the additional memory that the slave uses when saving out the job reports. We suspect that this shouldnā€™t be an issue with Deadline 7, but itā€™s definitely something weā€™ll keep our eye on.

We have a small render farm with
Windows 10.1809,
3ds Max 2019 update 3,
Corona Renderer 4 update 1,
Deadline 10.0.27.3.
Some nodes with 32Gb RAM experience such problem. Sometimes they donā€™t render and show message:

A new guard page for the stack cannot be created

Am I right that the reason is ā€œnot enough RAMā€?

1 Like

Has this issue been resolved or has any information came to light about it? I have ran across this same issue with a few of our High end cpu nodes on our farm that coincidentally have 32gb of ram. I am however submitting my job with vray and the memory allocation is set to use whats available.

Privacy | Site terms | Cookie preferences