machine limit 1, stuck stubs

LaszloSebo · May 18, 2016, 5:15pm

With 8, we are having that issue again that was resolved in deadline7, where a job with a machine limit of 1 would all of a sudden stop rendering or queueing any frames, because apparently its stubs are all spent. To solve that we have to up its limit to 2, and then 1 machine picks up.

Did the fix for that maybe not make it into deadline8?

jgaudet · May 18, 2016, 7:13pm

Hmmm, everything in 7 should be merged in 8.0, I can double-check to make sure.

We did make some changes to Limits in 8.0, though, so it’s possible the merged change got clobbered, or that this is just a straight regression. Do you have a reference handy for when we fixed the bug? Should help me find the specific commit that fixed it – if not, I’m sure I’ll be able to dig it up.

LaszloSebo · May 18, 2016, 7:51pm

Darn… cant find those mails/posts :\

LaszloSebo · May 19, 2016, 5:35pm

Some additional info from one slave found like this this morning (we get maybe 10 a day).

Lapro0865 picked up the job, trying to process it. It executed an OnSlaveStartingJob event, then crashed.

Last log of the slave:

2016-05-18 12:25:19:  Scanline: OnSlaveStartingJob sSlaveName LAPRO0865
2016-05-18 12:25:19:  Scanline: OnSlaveStartingJob sJobID 573cc10ae869c241b061522f
2016-05-18 12:25:19:  Scanline: RemoveJobFolderContents event called LAPRO0865
2016-05-18 12:25:19:  Scanline: cleaning sJobsDataFolder: C:\Users\scanlinevfx\AppData\Local\Thinkbox\Deadline8\slave\LAPRO0865\jobsData
2016-05-18 12:25:19:  Scanline: removing : C:\Users\scanlinevfx\AppData\Local\Thinkbox\Deadline8\slave\LAPRO0865\jobsData\573ca6911186a225241329ec
2016-05-18 12:25:21:  Scheduler Thread - Synchronization time for job files: 46.797 ms
2016-05-18 12:25:21:  Syncronizing Plugin Python from \\inferno2\deadline\repository8\plugins\Python
2016-05-18 12:25:22:  Syncronization complete. Took: 0 seconds
2016-05-18 12:25:22:  0: Got task!
2016-05-18 12:25:22:  0: Plugin will be reloaded because a new job has been loaded, or one of the job files or plugin files has been modified

It crashed twice in a row in fact:

Application: deadlinesandbox.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.IO.IOException
Stack:
   at System.Net.Sockets.NetworkStream.EndRead(System.IAsyncResult)
   at Deadline.Net.DeadlineMessageUtils+a.d(System.IAsyncResult)
   at System.Net.LazyAsyncResult.Complete(IntPtr)
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Net.ContextAwareResult.Complete(IntPtr)
   at System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
   at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)

immediately followed by another sandbox crash:

Application: deadlinesandbox.exe
Framework Version: v4.0.30319
Description: The process was terminated due to an unhandled exception.
Exception Info: System.IO.IOException
Stack:
   at System.Net.Sockets.NetworkStream.EndRead(System.IAsyncResult)
   at Deadline.Net.DeadlineMessageUtils+a.d(System.IAsyncResult)
   at System.Net.LazyAsyncResult.Complete(IntPtr)
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
   at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Net.ContextAwareResult.Complete(IntPtr)
   at System.Net.Sockets.BaseOverlappedAsyncResult.CompletionPortCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)
   at System.Threading._IOCompletionCallback.PerformIOCompletionCallback(UInt32, UInt32, System.Threading.NativeOverlapped*)

The slave still thinks its rendering. Nothing in the logs for 22 hours now (the job has a 20 hour timeout, which did not get handled):

However in the monitor its clearly unassigned (the task is just queued). The pulse logs show lapro0865 as ‘not rendering’ anymore. But the stub is not returned:

2016-05-18 12:31:07:  Performing repository repair
2016-05-18 12:31:07:  Performing Orphaned Task Scan...
2016-05-18 12:31:07:      Orphaned Task Scan - Loading rendering jobs
2016-05-18 12:31:07:      Orphaned Task Scan - Loaded 271 rendering jobs in 25.284 ms
2016-05-18 12:31:07:      Orphaned Task Scan - Scanning for orphaned tasks
2016-05-18 12:31:07:      Orphaned Task Scan - Separated jobs into 3 lists of 100
2016-05-18 12:31:07:      Orphaned Task Scan - Scanning job list 1 of 3 (100 jobs)
2016-05-18 12:31:08:      Orphaned Task Scan - Scanning job list 2 of 3 (100 jobs)
2016-05-18 12:31:08:      Orphaned Task Scan - Scanning job list 3 of 3 (71 jobs)
2016-05-18 12:31:08:      Orphaned Task Scan - Requeuing orphaned task '1' for LAPRO1806: the task is in the rendering state, but the slave is no longer rendering this task
2016-05-18 12:31:08:      Orphaned Task Scan - Requeuing orphaned task '0' for LAPRO0865: the task is in the rendering state, but the slave is no longer rendering this task
2016-05-18 12:31:08:      Orphaned Task Scan - Cleaned up 2 orphaned tasks in 1.478 s
2016-05-18 12:31:08:      Orphaned Task Scan - Done.
2016-05-18 12:31:08:  Performing Orphaned Limit Stub Scan...
2016-05-18 12:31:08:      Orphaned Limit Stub Scan - Loading limits
2016-05-18 12:31:08:      Orphaned Limit Stub Scan - Loaded 3908 limits in 81.304 ms
2016-05-18 12:31:08:      Orphaned Limit Stub Scan - Loading slave states
2016-05-18 12:31:08:      Orphaned Limit Stub Scan - Loaded 2298 slave states in 130.438 ms
2016-05-18 12:31:09:      Orphaned Limit Stub Scan - Offsetting usage count for limit 'nuke' by -2 to match holder count.
2016-05-18 12:31:09:      Orphaned Limit Stub Scan - Removing holder 'lapro3093' from limit '573ba6df37a32d36685f0670': no associated stub was found.
2016-05-18 12:31:09:      Orphaned Limit Stub Scan - Removing holder 'lapro3093' from limit '573cbf93c47c0210487f5767': no associated stub was found.
2016-05-18 12:31:09:      Orphaned Limit Stub Scan - Removing holder 'lapro0768-secondary' from limit '573cbf1f909e6623544cf96f': no associated stub was found.
2016-05-18 12:31:09:      Orphaned Limit Stub Scan - Removing holder 'lapro0610-secondary' from limit '573cc0e5021bc81c3036c1aa': no associated stub was found.
2016-05-18 12:31:09:      Orphaned Limit Stub Scan - Returned 0 orphaned limit stubs in 536.230 ms
2016-05-18 12:31:09:      Orphaned Limit Stub Scan - Done.

(Seeing stuff like this in the pulse log btw… is this normal?:

2016-05-18 12:24:51:  Cannot send notification because the email address list is empty.
2016-05-18 12:24:51:  Slave 'LAPRO0934-secondary' has stalled because it has not updated its state in 10.836 m. Performing house cleaning...
2016-05-18 12:24:51:  Could not find associated job class though.
2016-05-18 12:24:52:  Cannot send notification because the email address list is empty.
2016-05-18 12:24:52:  Slave 'LAPRO1867-secondary' has stalled because it has not updated its state in 10.700 m. Performing house cleaning...
2016-05-18 12:24:52:  Could not find associated job class though.
2016-05-18 12:24:52:  Cannot send notification because the email address list is empty.
2016-05-18 12:24:52:  Slave 'LAPRO1831-secondary' has stalled because it has not updated its state in 10.904 m. Performing house cleaning...
2016-05-18 12:24:52:  Could not find associated job class though.
2016-05-18 12:24:52:  Cannot send notification because the email address list is empty.
2016-05-18 12:24:52:  Slave 'LAPRO0522' has stalled because it has not updated its state in 11.069 m. Performing house cleaning...

LaszloSebo · May 19, 2016, 5:49pm

I was wrong about " (we get maybe 10 a day).". We currently have 97 slaves stuck in this state :-\

LaszloSebo · May 19, 2016, 5:50pm

If i click “close” on the slave, it starts printing into the log again:

2016-05-19 10:49:15:  BEGIN - LAPRO1861\scanlinevfx
2016-05-19 10:49:15:  Slave - slave shutdown: normal
2016-05-19 10:49:15:  Info Thread - requesting slave info thread quit.
2016-05-19 10:49:15:  0: Shutdown
2016-05-19 10:49:15:  Info Thread - shutdown complete
2016-05-19 10:49:15:  Scheduler Thread - Seconds before next job scan: 2
2016-05-19 10:49:16:  Scheduler - Returning all limit stubs.
2016-05-19 10:49:16:  Scheduler -   returning mx_128gb
2016-05-19 10:49:16:  Scheduler -   returning vray3
2016-05-19 10:49:16:  Scheduler -   returning 573bb82bdbebb8343442161d

But then freezes, does not actually quit

ryangagnon · May 19, 2016, 5:55pm

Hey Laszlo, we’re looking into these Sandbox errors. In the meantime, you may want to disable sandboxing for your slaves and see if that prevents the errors you’re seeing.

LaszloSebo · May 19, 2016, 6:09pm

Do you think this is related to sandboxing?

We could try turning it off, but in our joy that we finally don’t have to litter all our scripts with “reload(modulename)” calls, we have actually removed some of that code to improve our code base. Somewhat prematurely maybe hah.

ryangagnon · May 19, 2016, 7:27pm

I think it may be. Along with forums.thinkboxsoftware.com/vie … hp?p=65230. It’s probably worth looking into. We will look into fixing the Sandbox error and also look into handling Sandbox failures better.

LaszloSebo · May 19, 2016, 8:07pm

Do the slaves need to be restarted after this setting is applied?

ryangagnon · May 19, 2016, 8:13pm

Yes they do.