stalled machines not detected?

LaszloSebo · December 5, 2016, 12:55pm

We repeatedly find machines that are stalled out (not rendering, hanging etc), but deadline never detects that. For example:

Lapro1466 has crashed on this job yesterday around 9.42am, and even reported the error to deadline. The task got picked up by another machine, lapro0900 and the job eventually finished. But lapro1466 is still hanging there, it still says its working on the job, and is never ever marked as stalled.

[attachment=0]Capture.PNG[/attachment]

LaszloSebo · December 5, 2016, 12:59pm

When i remoted into the machine, it had a deadline sandbox process ‘application error’ window. As soon as i logged in, our popup handler detected it and clicked OK (weird that it wouldnt see it without me logging in…). Note that we dont have sandboxes enabled on the farm yet.

LaszloSebo · December 5, 2016, 1:02pm

ANother machine in a similar situation had the following popup:

Once i clicked OK on that error message, deadline slave disappeared. Otherwise the machine has been hanging like that for almost a day, and it was not marked stalled.

ryangagnon · December 5, 2016, 3:16pm

Hey Laszlo,

The sandbox error is likely the event sandbox, since event callbacks are still sandboxed even if sandboxing is disabled for plugins. It looks to me like this is the same error you reported earlier (forums.thinkboxsoftware.com/vie … 10&t=14927). There is an unhandled exception occuring in the render thread, and that is likely preventing the slave from doing anything. Would you be able to try upgrading to 8.0.12 and enable plugin sandboxing?

LaszloSebo · December 5, 2016, 5:28pm

Here is another one, where it seems that max crashed, but somehow the slave doesnt handle the crash properly (the job is since finished, the slave never gets marked as stalled):

[attachment=0]Capture.PNG[/attachment]

LaszloSebo · December 5, 2016, 5:31pm

Yeah we are in the process of integrating it and then doing our internal tests with it on our staging repo. Actual live rollout will hopefully happen within a week or two (things go slower due to risk to deliveries)

To me it seems like the “lifeline” thread of the slave is completely decoupled from the rendering process. So even though the rendering threads can totally crash out, the lifeline keeps pinging the db, so pulse never detects this machine as stalled. It may be better if the lifeline process was somehow dependent on the health of the critical threads the slave is running.

ryangagnon · December 5, 2016, 6:18pm

Yeah the slave scheduling thread is probably being killed by that unhandled exception in the render thread, so the only threads that remain are the UI thread and the thread that updates the slaves info. This scenario is one of the reasons we implemented sandboxed plugins, hopefully the upgrade resolves the issue for you.

LaszloSebo · December 5, 2016, 8:56pm

Since the ‘slave info’ update is used as a method to detect if the slave is still healthy, maybe it could be coupled to the other threads somehow? Say if it doesnt get updates from the other 2 threads, it stops reporting to mongo as well? Kinda like the canary in the mines?
Deadline’s canary is so robust, the mine explodes and the bird is still there chirping

ryangagnon · December 5, 2016, 8:58pm

Yeah ideally if the scheduler thread goes down we should just have the slave go down, and log what happened. I’ll add that request to the wishlist.

LaszloSebo · December 7, 2016, 8:39am

Here is another one, maybe these stact traces help make it more robust, even without sandboxing. In this particular case, the physical ram ran out temporarily (one would think in the age of 64bit, thats only going to make things slow, but not crash anything, but .net is magical )
The nuke process is already gone, the slave’s been hanging about 5 hours so far:

2016-12-06 19:44:55:  0: STDOUT: Writing S:/ncs/inferno2/projects/gate/scenes/NXN_017_0420/images/render2d/testcomp-test/v0036_jwl_test/beauty_2150x1134x1_linear/NXN_017_0420_testcomp-test_beauty_v0036.1017.exr took 249.59 seconds
2016-12-06 19:44:58:  0: STDOUT: Frame 1017 (1 of 2)
2016-12-06 19:46:15:  0: STDOUT: insertColor Iporting make
2016-12-06 19:46:15:  0: STDOUT: .9.9
2016-12-06 19:46:15:  0: INFO: Process exit code: -1073741819
2016-12-06 19:46:15:  0: An exception occurred: Error: Error: Renderer returned non-zero error code, -1073741819, which represents a Memory Access Violation. There may not be enough memory, or the memory has become corrupt.
2016-12-06 19:46:15:     at Deadline.Plugins.PluginWrapper.RenderTasks(String taskId, Int32 startFrame, Int32 endFrame, String& outMessage, AbortLevel& abortLevel) (Deadline.Plugins.RenderPluginException)
2016-12-06 19:46:15:  0: Unloading plugin: Nuke
2016-12-06 19:46:15:  Scheduler Thread - Render Thread 0 threw a major error: 
2016-12-06 19:46:15:  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2016-12-06 19:46:15:  Exception Details
2016-12-06 19:46:15:  RenderPluginException -- Error: Error: Renderer returned non-zero error code, -1073741819, which represents a Memory Access Violation. There may not be enough memory, or the memory has become corrupt.
2016-12-06 19:46:15:     at Deadline.Plugins.PluginWrapper.RenderTasks(String taskId, Int32 startFrame, Int32 endFrame, String& outMessage, AbortLevel& abortLevel)
2016-12-06 19:46:15:  RenderPluginException.Cause: JobError (2)
2016-12-06 19:46:15:  RenderPluginException.Level: Major (1)
2016-12-06 19:46:15:  RenderPluginException.HasSlaveLog: True
2016-12-06 19:46:15:  RenderPluginException.SlaveLogFileName: C:\ProgramData\Thinkbox\Deadline8\logs\deadlineslave_secondary_renderthread_0-LAPRO1518-0000.log
2016-12-06 19:46:15:  Exception.Data: ( )
2016-12-06 19:46:15:  Exception.TargetSite: Void RenderTask(System.String, Int32, Int32)
2016-12-06 19:46:15:  Exception.Source: deadline
2016-12-06 19:46:15:  Exception.HResult: -2146233088
2016-12-06 19:46:15:    Exception.StackTrace: 
2016-12-06 19:46:15:     at Deadline.Plugins.Plugin.RenderTask(String taskId, Int32 startFrame, Int32 endFrame)
2016-12-06 19:46:15:     at Deadline.Slaves.SlaveRenderThread.a(TaskLogWriter A_0)
2016-12-06 19:46:15:  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2016-12-06 19:46:16:  Error occurred while writing report log:
2016-12-06 19:46:16:  An error occurred while saving job report: Exception of type 'System.OutOfMemoryException' was thrown. (System.OutOfMemoryException)
2016-12-06 19:46:16:  System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
2016-12-06 19:46:16:     at System.Threading.Thread.StartInternal(IPrincipal principal, StackCrawlMark& stackMark)
2016-12-06 19:46:16:     at System.Threading.Thread.Start(StackCrawlMark& stackMark)
2016-12-06 19:46:16:     at System.Threading.Thread.Start(Object parameter)
2016-12-06 19:46:16:     at Deadline.Slaves.CommandListener.b()
2016-12-06 19:46:16:     at Deadline.Slaves.CommandListener.c()
2016-12-06 19:46:16:     at Deadline.Slaves.CommandListener..ctor(Int32 commandPort)
2016-12-06 19:46:16:     at a.a(String[] A_0)
2016-12-07 00:28:11:  Listener Thread - ::ffff:172.18.8.91 has connected
2016-12-07 00:28:11:  Listener Thread - Received message: StreamLog
2016-12-07 00:36:42:  Connecting to Slave log: LAPRO1518-secondary

LaszloSebo · December 7, 2016, 8:45am

Could something like this make any difference?
https://msdn.microsoft.com/en-us/library/hh285054(v=vs.110).aspx

LaszloSebo · December 7, 2016, 8:55am

Or maybe some IDisposable objects are not being disposed on exception?

LaszloSebo · December 7, 2016, 3:22pm

I’m investigating wrapping deadline into windows JobObjects, to ensure the rendering processes dont overflow. Looks pretty useful, might be something to consider as a core feature? Lets you limit cpu usage/ram usage etc

https://msdn.microsoft.com/en-us/library/ms684161(VS.85).aspx

MikeOwen · December 7, 2016, 3:39pm

Looks pretty useful, might be something to consider as a core feature?

We did. We call it “sandboxing”.

LaszloSebo · December 8, 2016, 11:07am

Ah awesome, so sandboxing uses windows JobObjects under the hood? Would it be possible to expose its ram usage limits?
Specifically the JobMemoryLimit from the JOBOBJECT_EXTENDED_LIMIT_INFORMATION struct. Are all deadline processes and subprocesses part of the same JobObject (even if there are multiple slaves running)?

We would like to maximize all the rendering processes to only ever go to 90% of the total physical ram in the machine.

MikeOwen · December 8, 2016, 11:21am

Yup.

Jon wrote a blog post on the subject just the other week, which I believe answers some of your questions (to save me re-typing):
deadline.thinkboxsoftware.com/fe … ow-you-had

As per Jon’s last paragraph in his blog post, I will give him a poke to carry on this conversation with you.

LaszloSebo · December 8, 2016, 2:34pm

Cool, thanks Mike!

It seems that each subprocess (nuke/max processes etc) is in a separateJobObject, so their limits are not shared:

Would it be possible to override this behavior, so all deadline spawned subprocesses (nuke, max, maya etc), independent of what slave they came from would be in the same JobObject? That way we could have global ram limits etc.

Sadly, on win7 a process can only belong to a single job, so we could not currently be able to create our own jobobject that ensures that these render processes dont over-use system resources. We could only control them separately by overwriting deadline’s jobobject’s settings, which is not flexible enough.

LaszloSebo · December 8, 2016, 2:35pm

Here is an example of a shared limit spanning across multiple processes:

jgaudet · December 8, 2016, 6:15pm

So we currently do not use Windows JobObjects in Deadline to limit machine resources.

We use it mostly to keep track of the PIDs that Deadline is responsible for, either directly (ie, we started them) or indirectly (a sub-process of the processes that we started). We have the JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE flag set, which should ensure that rendering processes are tied to the lifetime of the Sandbox they run in.

For those purposes, it made more sense to have one Job Object per render thread. If we do go the route of limiting machine resources, it would definitely make sense to have these re-used on a slave or machine-level. I’m pretty sure it’s rather trivial to do as well, it basically just comes down to naming the Job Objects something predictable, if I recall correctly. Might be something for us to look at as more people get on Windows 8+, which supports nested Job Objects

LaszloSebo · December 8, 2016, 11:56pm

Haha, good luck with that

Thanks for the additional info!