beta14 hung slave

LaszloSebo · March 8, 2013, 6:15pm

Hey guys,

great job on beta14, out of 240 nodes, i found only 1 that was crashed this morning!! HUGE improvement

Attached are the logs for that one slave, starting from the time it updated itself.

The slave reported stalled, but the slave process was not running on the slave. Launcher was there.

cheers,
laszlo
deadlinelauncher-LAPRO0226-2013-03-07-0000.zip (122 KB)

rrussell · March 8, 2013, 6:33pm

Thanks for the update! So it looks like we’ve addressed the hanging problem, but it looks like the slave can still completely die on occasion. Unfortunately, there isn’t anything in the logs explaining what happened, so it was probably some corrupt memory in the slave.

Was it just Deadline running on this machine, or was assfreezer running on it too? There shouldn’t be a conflict, but just curious.

rrussell · March 8, 2013, 6:34pm

Also, was this in nogui mode or not?

rrussell · March 8, 2013, 6:39pm

Also, was there a windows crash dialog on the machine? If so, can you generate the dump, zip it up, and post it?

You could also check the event log on the system for any information about the crash too.

Thanks!

LaszloSebo · March 8, 2013, 6:50pm

There was no crash dump / dialog sadly…

However, the event viewer was a BINGO:

"Event Type: Error
Event Source: .NET Runtime
Event Category: None
Event ID: 1026
Date: 3/8/2013
Time: 5:50:52 AM
User: N/A
Computer: LAPRO0226
Description:
Application: deadlineslave.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.OutOfMemoryException
Stack:
at FranticX.Utils.ExceptionUtils.ToString(System.Exception)
at Deadline.Slaves.SlaveInfoThread.d()
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
at System.Threading.ThreadHelper.ThreadStart()

For more information, see Help and Support Center at go.microsoft.com/fwlink/events.asp.
"

The launcher was still running, both slave & launcher in GUI mode.

It did have assfreezer running, and potentially rendering very memory intensive jobs.
The machine had no groups / pools assigned, so while it was an active slave it had no jobs to work on. That’s basically the way we ‘turned off’ deadline on parts of our farm, to still put a load on the mongo db, repository for testing purposes, and also potentially find hanging / crashing issues with the slave that are independent of the render process.

rrussell · March 8, 2013, 8:30pm

That could be it. We’re going to try adding a special exception handler for out of memory exceptions that will attempt to restart the slave when something like this happens.

Cheers,

Ryan