AWS Thinkbox Discussion Forums

slave crash, job is stuck

Seems like housecleaning never caught this crash (sadly due to interdependencies, it held up 20-30 other jobs overnight).

The slave tried to dequeue job 54377c7c0294981c04a81a6b, but seems to have crashed while doing it. It then held on to the stub till the artist noticed it and notified me. Checking the limits, lapro1455 was listed in the ‘holder’ field.

The log of lapro1455:

...
2014-10-10 05:59:54:  Scheduler - Returning 5436af319154b51534ca7bb6
2014-10-10 05:59:54:  Scheduler - Returning 5436c389cf71592f7037f39d
2014-10-10 05:59:54:  Scheduler - Returning 5436c3decf71591650910550
2014-10-10 05:59:54:  Scheduler - Returning 5436c3e1cf715918507a73f1
2014-10-10 05:59:54:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:54:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:54:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:54:  Scheduler - Returning 5436cdcac5f29c1a700671b7
2014-10-10 05:59:54:  Scheduler - Returning 5436cdccc5f29c1614af27e1
2014-10-10 05:59:54:  Scheduler - Returning 5436cdcfc5f29c0d6c434267
2014-10-10 05:59:54:  Scheduler - Returning 5436dfc12702df168cf95acb
2014-10-10 05:59:54:  Scheduler - Slave has been marked bad for job 5436dfce2702df2208777789, skipping this job.
2014-10-10 05:59:54:  Scheduler - Returning 5436dfe82702df1564a9c8a8
2014-10-10 05:59:54:  Scheduler - Returning 5436ea97b25c5a23146a5c10
2014-10-10 05:59:54:  Scheduler - Returning 54370d05c5f29c085cc6a8ca
2014-10-10 05:59:54:  Scheduler - Returning 54370d07c5f29c14e86f5ef4
2014-10-10 05:59:54:  Scheduler - Returning 54370d0ac5f29c1858cd5fff
2014-10-10 05:59:54:  Scheduler - Returning 54370d0cc5f29c14ac50bb9a
2014-10-10 05:59:54:  Scheduler - Preliminary check: The 543713789154b5134809a77d limit is maxed out.
2014-10-10 05:59:54:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:54:  Scheduler - Returning 54372e4f6882d01448c1a2a1
2014-10-10 05:59:54:  Scheduler - Returning 54372e516882d018a0219209
2014-10-10 05:59:54:  Scheduler - Returning 5437315d01f73e20d0072a07
2014-10-10 05:59:54:  Scheduler - Returning 543732e901f73e2700c5c18e
2014-10-10 05:59:54:  Scheduler - Returning 5437331601f73e1a9ca050fc
2014-10-10 05:59:54:  Scheduler - Returning 5437331901f73e1ac04b4884
2014-10-10 05:59:54:  Scheduler - Returning 54373f37162dfe18741caf09
2014-10-10 05:59:54:  Scheduler - Returning 54373f39162dfe01182b7a11
2014-10-10 05:59:54:  Scheduler - Returning 54373f3b162dfe2a3031e5a3
2014-10-10 05:59:54:  Scheduler - Returning 54373f3d162dfe17b0217433
2014-10-10 05:59:54:  Scheduler - Returning 54373f40162dfe2464f1c64a
2014-10-10 05:59:55:  Scheduler - Returning 54373f42162dfe0638f831dc
2014-10-10 05:59:55:  Scheduler - Returning 54373f44162dfe1fc8f01fe2
2014-10-10 05:59:55:  Scheduler - Returning 54373f46162dfe1dc40d6876
2014-10-10 05:59:55:  Scheduler - Returning 54373f49162dfe2b480126c1
2014-10-10 05:59:55:  Scheduler - Returning 54373f4b162dfe20ccf4e50b
2014-10-10 05:59:55:  Scheduler - Returning 54373f4d162dfe20bcecd311
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Returning 543752bd74376811fc80df2f
2014-10-10 05:59:55:  Scheduler - Returning 543752c074376823f8b31492
2014-10-10 05:59:55:  Scheduler - Returning 543752c3743768211cdcea80
2014-10-10 05:59:55:  Scheduler - Returning 543752c674376820200f1fe3
2014-10-10 05:59:55:  Scheduler - Returning 5437586d743768046c9b4bb6
2014-10-10 05:59:55:  Scheduler - Returning 54375870743768227888ea12
2014-10-10 05:59:55:  Scheduler - Returning 543758737437680c74b2bfff
2014-10-10 05:59:55:  Scheduler - Returning 543758767437682384ed54d8
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: Slave is not whitelisted for mx_128gb limit.  (from previous attempt)
2014-10-10 05:59:55:  Scheduler - Preliminary check: The 54377c6e0294983d549a6235 limit is maxed out.
2014-10-10 05:59:55:  Scheduler - Preliminary check: The 54377c700294980270668492 limit is maxed out.
2014-10-10 05:59:55:  Scheduler - Preliminary check: The 54377c750294981f34053a65 limit is maxed out.
2014-10-10 05:59:55:  Scheduler - Successfully dequeued 1 task(s).  Returning.
2014-10-10 05:59:55:  0: Shutdown
2014-10-10 05:59:56:  0: Shutdown
2014-10-10 05:59:56:  0: Exited ThreadMain(), cleaning up...
2014-10-10 05:59:57:  Scheduler Thread - Unexpected Error Occured
2014-10-10 05:59:57:  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2014-10-10 05:59:57:  Exception Details
2014-10-10 05:59:57:  OutOfMemoryException -- Exception of type 'System.OutOfMemoryException' was thrown.
2014-10-10 05:59:57:  Exception.Data: ( )
2014-10-10 05:59:57:  Exception.TargetSite: Void StartInternal(System.Security.Principal.IPrincipal, System.Threading.StackCrawlMark ByRef)
2014-10-10 05:59:57:  Exception.Source: mscorlib
2014-10-10 05:59:57:    Exception.StackTrace: 
2014-10-10 05:59:57:     at System.Threading.Thread.StartInternal(IPrincipal principal, StackCrawlMark& stackMark)
2014-10-10 05:59:57:     at System.Threading.Thread.Start()
2014-10-10 05:59:57:     at Deadline.Slaves.SlaveRenderThread.Initialize(Int32 threadID)
2014-10-10 05:59:57:     at Deadline.Slaves.SlaveSchedulerThread.ThreadMain()
2014-10-10 05:59:57:  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2014-10-10 11:05:10:  Connecting to slave log: LAPRO1455
2014-10-10 11:05:12:  Listener Thread - fe80::2cab:17ea:53ed:ff64%13 has connected
2014-10-10 11:05:12:  Listener Thread - Received message: StreamLog
2014-10-10 11:05:12:  Listener Thread - Responded with: Success
2014-10-10 11:05:26:  Slave - slave shutdown: normal
2014-10-10 11:05:26:  Listener Thread - OnConnect: Listener Socket has been closed.
2014-10-10 11:05:26:  Info Thread - requesting slave info thread quit.
2014-10-10 11:05:26:  Info Thread - shutdown complete

The slave was in a state where it wasnt doing anything, just sitting there, i could open & connect to the log. Had to manually restart it, then it returned its stub.

Was the task stuck in the Waiting To Start state? This is something that is now handled better in Deadline 7.

The task actually was queued. The job is a sim job, so had a machine limit of 1. This machine hanging like that stopped others from being able to pick it.

I think for this in particular, the crashes should be handled differently. It seems that slaves can get in a state where they are still running, pulsing, but in fact are in a crashed state holding on to job stubs. So housecleaning never finds them as they arent ‘stalled’, but they are actually hung up.

The situation with limit stubs is already handled properly in Deadline 7 as well. It was part of the refactoring required to have different stub types (per slave, per machine, and per task).

Privacy | Site terms | Cookie preferences