2 machines on the same job, even tho it has a single machine limit

LaszloSebo · September 21, 2016, 8:35am

This comes up once in a while with our sim jobs, that can only have a single machine processing them at a time:

Looking at the logs reveals that the first machine (lapro1810) has already crashed at 13:15 when the next machine picked up the next frame at 13:18. But it seems like the lapro1810 slave also stalled out, somehow leaving the task still assigned to itself (but already having given back the limit token? otherwise how did lapro0841 pick up the next frame?).

This causes 2 problems:

Confused artist, they think 2 machines are on their sim jobs, causing simming conflicts
out of order frame assignment, since now the first actually assigned frame will be task 1 and not task 0…

The last lines of the slave log of 1810:

2016-09-20 13:15:00:  Network log file does not exist: %TEMP%\VRayLog.txt)
2016-09-20 13:15:00:     at Deadline.Plugins.PluginWrapper.RenderTasks(String taskId, Int32 startFrame, Int32 endFrame, String& outMessage, AbortLevel& abortLevel)
2016-09-20 13:15:00:  RenderPluginException.Cause: JobError (2)
2016-09-20 13:15:00:  RenderPluginException.Level: Major (1)
2016-09-20 13:15:00:  RenderPluginException.HasSlaveLog: True
2016-09-20 13:15:00:  RenderPluginException.SlaveLogFileName: C:\ProgramData\Thinkbox\Deadline8\logs\deadlineslave_renderthread_0-LAPRO1810-0000.log
2016-09-20 13:15:00:  Exception.Data: ( )
2016-09-20 13:15:00:  Exception.TargetSite: Void RenderTask(System.String, Int32, Int32)
2016-09-20 13:15:00:  Exception.Source: deadline
2016-09-20 13:15:00:  Exception.HResult: -2146233088
2016-09-20 13:15:00:    Exception.StackTrace: 
2016-09-20 13:15:00:     at Deadline.Plugins.Plugin.RenderTask(String taskId, Int32 startFrame, Int32 endFrame)
2016-09-20 13:15:00:     at Deadline.Slaves.SlaveRenderThread.a(TaskLogWriter A_0)
2016-09-20 13:15:00:  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2016-09-20 13:15:00:  Scheduler Thread - Seconds before next job scan: 2
2016-09-20 13:15:02:  Scheduler - Performing Job scan on Primary Pools with scheduling order Pool, Weighted, Balanced
2016-09-20 13:15:02:  Scheduler - Using enhanced scheduler balancing
2016-09-20 13:15:02:  Scheduler - Preliminary check: The 57e1946bcb2bf34f80b99c64 limit is maxed out.
2016-09-20 13:15:02:  Scheduler - Preliminary check: The 57e19493cb2bf34738d44b2a limit is maxed out.
2016-09-20 13:15:02:  Scheduler - Preliminary check: The 57e1948ccb2bf35594f91222 limit is maxed out.
2016-09-20 13:15:02:  Scheduler - Preliminary check: The 57dc9027cb2bf31dbc853abc limit is maxed out.
2016-09-20 13:15:02:  Scheduler - Preliminary check: The 57dc9079cb2bf31750311738 limit is maxed out.
2016-09-20 13:15:02:  Scheduler - Preliminary check: The 57bb3e6815314a2f70dafc6b limit is maxed out.
2016-09-20 13:15:03:  Scheduler - Preliminary check: The 57e198890294999a70b28bda limit is maxed out.
2016-09-20 13:15:03:  Scheduler - Successfully dequeued 1 task(s).  Returning.
2016-09-20 13:15:03:  0: Shutdown
2016-09-20 13:15:03:  0: Exited ThreadMain(), cleaning up...
2016-09-20 13:15:04:  Scheduler - Returning limit stubs not in use.
2016-09-20 13:15:04:  Scheduler -   returning mx_render
2016-09-20 13:15:04:  Scheduler -   returning 57e180800294998da423e5d9
2016-09-20 13:15:04:  Scheduler -   returning mx_flowlinerender
2016-09-20 13:15:04:  Scheduler -   returning vray3
2016-09-20 13:15:04:  Scheduler Thread - Unexpected Error Occurred
2016-09-20 13:15:04:  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2016-09-20 13:15:04:  Exception Details
2016-09-20 13:15:04:  InvalidProgramException -- Sandbox Process is not available, may be disposed already.
2016-09-20 13:15:04:  Exception.Data: ( )
2016-09-20 13:15:04:  Exception.TargetSite: Void OnSlaveStartingJob(System.String, Deadline.Jobs.Job, Deadline.Controllers.DataController)
2016-09-20 13:15:04:  Exception.Source: deadline
2016-09-20 13:15:04:  Exception.HResult: -2146233030
2016-09-20 13:15:04:    Exception.StackTrace: 
2016-09-20 13:15:04:     at Deadline.Events.SandboxedEventManager.OnSlaveStartingJob(String slaveName, Job job, DataController dataController)
2016-09-20 13:15:04:     at Deadline.Slaves.SlaveSchedulerThread.a(SlaveStatus slaveStatus)
2016-09-20 13:15:04:     at Deadline.Slaves.SlaveSchedulerThread.f()
2016-09-20 13:15:04:     at Deadline.Slaves.SlaveSchedulerThread.h()
2016-09-20 13:15:04:  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2016-09-20 13:15:04:  Scheduler Thread - Unexpected Error Occurred While Handling Exception (task requeue)
2016-09-20 13:15:04:  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2016-09-20 13:15:04:  Exception Details
2016-09-20 13:15:04:  InvalidProgramException -- Sandbox Process is not available, may be disposed already.
2016-09-20 13:15:04:  Exception.Data: ( )
2016-09-20 13:15:04:  Exception.TargetSite: Void OnJobError(Deadline.Jobs.Job, System.String[], Deadline.Jobs.Task, Deadline.Reports.Report, Deadline.Controllers.DataController)
2016-09-20 13:15:04:  Exception.Source: deadline
2016-09-20 13:15:04:  Exception.HResult: -2146233030
2016-09-20 13:15:04:    Exception.StackTrace: 
2016-09-20 13:15:04:     at Deadline.Events.SandboxedEventManager.OnJobError(Job job, String[] auxiliaryFilenames, Task task, Report errorReport, DataController dataController)
2016-09-20 13:15:04:     at Deadline.Controllers.DataController.ReportError(Exception e, Slave slave, IPlugin plugin, Job job, Task task, TimeSpan taskTimeElapsed)
2016-09-20 13:15:04:     at Deadline.Slaves.SlaveSchedulerThread.a(Exception A_0)
2016-09-20 13:15:04:  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2016-09-20 13:15:04:  Scheduler Thread - exception occurred:
2016-09-20 13:15:04:  Scheduler Thread - Unexpected Error Occurred
2016-09-20 13:15:04:  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2016-09-20 13:15:04:  Exception Details
2016-09-20 13:15:04:  InvalidProgramException -- Sandbox Process is not available, may be disposed already.
2016-09-20 13:15:04:  Exception.Data: ( )
2016-09-20 13:15:04:  Exception.TargetSite: Void OnJobError(Deadline.Jobs.Job, System.String[], Deadline.Jobs.Task, Deadline.Reports.Report, Deadline.Controllers.DataController)
2016-09-20 13:15:04:  Exception.Source: deadline
2016-09-20 13:15:04:  Exception.HResult: -2146233030
2016-09-20 13:15:04:    Exception.StackTrace: 
2016-09-20 13:15:04:     at Deadline.Events.SandboxedEventManager.OnJobError(Job job, String[] auxiliaryFilenames, Task task, Report errorReport, DataController dataController)
2016-09-20 13:15:04:     at Deadline.Controllers.DataController.ReportError(Exception e, Slave slave, IPlugin plugin, Job job, Task task, TimeSpan taskTimeElapsed)
2016-09-20 13:15:04:     at Deadline.Slaves.SlaveSchedulerThread.a(Exception A_0)
2016-09-20 13:15:04:     at Deadline.Slaves.SlaveSchedulerThread.h()
2016-09-20 13:15:04:  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2016-09-20 13:15:04:  Scheduler Thread - Unexpected Error Occurred While Handling Exception (task requeue)
2016-09-20 13:15:04:  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2016-09-20 13:15:04:  Exception Details
2016-09-20 13:15:04:  InvalidProgramException -- Sandbox Process is not available, may be disposed already.
2016-09-20 13:15:04:  Exception.Data: ( )
2016-09-20 13:15:04:  Exception.TargetSite: Void OnJobError(Deadline.Jobs.Job, System.String[], Deadline.Jobs.Task, Deadline.Reports.Report, Deadline.Controllers.DataController)
2016-09-20 13:15:04:  Exception.Source: deadline
2016-09-20 13:15:04:  Exception.HResult: -2146233030
2016-09-20 13:15:04:    Exception.StackTrace: 
2016-09-20 13:15:04:     at Deadline.Events.SandboxedEventManager.OnJobError(Job job, String[] auxiliaryFilenames, Task task, Report errorReport, DataController dataController)
2016-09-20 13:15:04:     at Deadline.Controllers.DataController.ReportError(Exception e, Slave slave, IPlugin plugin, Job job, Task task, TimeSpan taskTimeElapsed)
2016-09-20 13:15:04:     at Deadline.Slaves.SlaveSchedulerThread.a(Exception A_0)
2016-09-20 13:15:04:  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2016-09-20 13:15:04:  An unhandled exception occurred: Sandbox Process is not available, may be disposed already. (System.InvalidProgramException)
2016-09-20 13:15:04:     at Deadline.Events.SandboxedEventManager.OnJobError(Job job, String[] auxiliaryFilenames, Task task, Report errorReport, DataController dataController)
2016-09-20 13:15:04:     at Deadline.Controllers.DataController.ReportError(Exception e, Slave slave, IPlugin plugin, Job job, Task task, TimeSpan taskTimeElapsed)
2016-09-20 13:15:04:     at Deadline.Slaves.SlaveSchedulerThread.a(Exception A_0)
2016-09-20 13:15:04:     at Deadline.Slaves.SlaveSchedulerThread.h()
2016-09-20 13:15:04:     at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
2016-09-20 13:15:04:     at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
2016-09-20 13:15:04:     at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
2016-09-20 13:15:04:     at System.Threading.ThreadHelper.ThreadStart()

After that the slave went down. Our slave restart mechanism restarted it 3 mins later:

2016-09-20 13:18:48:  BEGIN - LAPRO1810\scanlinevfx
2016-09-20 13:18:48:  Deadline Slave 8.0 [v8.0.5.1 Release (846388743)]
2016-09-20 13:18:51:  Auto Configuration: A ruleset has been received from Pulse
2016-09-20 13:18:52:  Plugin sandboxing will not be used because it is disabled in the Repository Options.
2016-09-20 13:18:52:  Info Thread - Created.
2016-09-20 13:18:52:  Slave 'LAPRO1810' has stalled because of an unexpected exit. Performing house cleaning...
2016-09-20 13:18:52:  Could not find associated job for this slave.
2016-09-20 13:18:53:  Cannot send notification because the email address list is empty.

lapro0841 picks up the next frame a few seconds later.

2016-09-20 13:18:57:  BEGIN - LAPRO0841\scanlinevfx
2016-09-20 13:18:57:  0: Plugin will be reloaded: Auxiliary files for the current job have been modified.
2016-09-20 13:18:57:  0: Loaded plugin 3dsmax (\\inferno2\deadline\repository8\plugins\3dsmax)
2016-09-20 13:18:57:  0: INFO: Executing plugin script 'C:\Users\scanlinevfx\AppData\Local\Thinkbox\Deadline8\slave\lapro0841\plugins\57e1988d029499946841df5f\3dsmax.py'
2016-09-20 13:18:57:  0: INFO: About: 3dsmax Plugin for Deadline
2016-09-20 13:18:57:  0: INFO: The job's environment will be merged with the current environment before rendering
2016-09-20 13:18:57:  0: Start Job timeout is disabled.
2016-09-20 13:18:57:  0: Task timeout is 24000 seconds (Regular Task Timeout)
2016-09-20 13:18:57:  0: Loaded job: [MEG] MM_015_0000_v0132_str_UpdatedSim_cache_flowline_HappyOcean_4 (57e1988d029499946841df5f)
2016-09-20 13:18:57:  0: INFO: Executing job preload script 'C:\Users\scanlinevfx\AppData\Local\Thinkbox\Deadline8\slave\lapro0841\plugins\57e1988d029499946841df5f\JobPreLoad.py'
2016-09-20 13:18:57:  0: INFO: JobPreLoad.__main__ version: 1.59

But why isnt lapro0841 picking up task 0? Is the housecleaning somehow giving back the job token first, and only then the task token?

ryangagnon · September 21, 2016, 2:17pm

The problem seems to be with this line in the stalled Slave’s log:

Since the slave cannot find the job (the Slave must not have saved the assigned Job ID or Task ID before crashing), the Task becomes orphaned. Normally, if the Slave knows what Job it was working on when it died, it will attempt to resume that job.

Repository repair should eventually detect this problem, but the issue is not that Limits are not being respected, it’s that when the slave crashed the Task State was not updated. Therefore, when a new Slave picks up that Job, the first task still thinks its being rendered by the original Slave, even though the original Slave is not rendering that task and does not think it is. Repository Repair will detect that a Task has become orphaned if it has been rendering for at least 5 minutes.

Is this an issue you see often? It may help to increase how often a Slave updates it’s information by decreasing the interval found in the Monitor at “Tools->Configure Repository Options->Performance Settings->Slave Intervals->Number of seconds between Slave information updates”.

LaszloSebo · September 22, 2016, 9:13am

Shouldn’t the slave’s own housecleaning upon restart clean this up though?
Seems like it resets the job’s limit properly, cause the next task is only picked up after the first slave restarts and does its housecleaning, but the task remains assigned there…

We previously had another issue where the limits were in fact not respected, and sometimes 2 machines were picking up the same job, which was i believe fixed in one of the point releases. This new issues comes up enough that people put in multiple reports about it, thinking its still the same issue. It is confusing, and also, incorrect behavior, since it starts the sim out of order…

ryangagnon · September 22, 2016, 1:48pm

Normally, yes it should. However, this line here from the stalled Slave’s log:

shows that the Slave does not know which Job it was working on when it crashed, therefore it cannot know to correct that Job’s counts. It also does not know which Task it was working on, which would prevent it from knowing which Task to adjust even if it did know which Job. The limits are getting properly cleaned up because the Slave is just returning all its limit stubs when it comes back to life.

Also, as I mentioned, adjusting how often the Slaves report their information may help with this problem, since the Slave would be more likely to have recorded which Job it was working on before it crashed.

Is “Enforce Sequential Rendering” set for that Job? That would prevent the next Task from being picked up by another Slave until the previous Task was actually finished. If the order the Tasks render in is important, then that flag would definitely help.

LaszloSebo · September 22, 2016, 4:16pm

Doesn’t the limit stub that its returning have the job ID in it? I thought it did… all the stubs are essentially job IDs like “57e1988d029499946841df5f”

This feels like it would just reduce the likelihood of this happening, but not eliminate it. Correct?

Yes, its turned on, all our flowline sims need that setting to be turned on.

ryangagnon · September 22, 2016, 4:21pm

Yes you are correct it does. We will look into using that to properly free up the orphaned task when the slave starts back up.

Turns out that one of the first things the Slave does is report that information, so changing the interval should not actually matter at all. The fact that the Slave dies completely when the sandbox goes down is problematic, we will look into addressing that as well.

As for the sequential rendering, that should definitely prevent this behavior, so there may be something wrong with that as well.

LaszloSebo · October 25, 2016, 3:47pm

Hi Ryan,

I’ve noticed this in the 8.0.10 release notes:

When the slave starts back up after stalling, it now tries harder to find what task it was previously working on.

Sounds like a fix for this issue, am i correct?

cheers
l

ryangagnon · October 25, 2016, 3:59pm

Should be fixed now, yes

LaszloSebo · October 26, 2016, 10:02am

Awesome, thanks!