Render nodes won't pick new tasks - error in scheduler thread

Jerome_Samson · January 30, 2024, 10:53pm

Hello,

We’re having some trouble since few days with our render nodes. Some of them will regularly display an error message in their log (see below) and won’t be able to start any new task.

It can last for few hours ou maybe few days (it’s hard to monitor) but the machine stay in Idle state rather than turning in to Stalled.

2024-01-30 23:47:39:  ERROR: Scheduler Thread - Unexpected Error Occurred
2024-01-30 23:47:39:  Scheduler Thread - Object reference not set to an instance of an object. (System.NullReferenceException)
2024-01-30 23:48:14:  ERROR: Scheduler Thread - Unexpected Error Occurred
2024-01-30 23:48:14:  Scheduler Thread - Object reference not set to an instance of an object. (System.NullReferenceException)

Restarting or even reinstalling the deadline slave service won’t solve the issue.
Eventually some nodes will be able to pick up a new task and will render fine again until it breaks again.

Overall we’re loosing a huge amount of rendering cycles, any pointers to help solve this issue ?
Any ideas how we could get more details on this, maybe raising the log level on the nodes ? or checking some deadline internals ?

Thanks for any help

Jerome_Samson · January 30, 2024, 11:15pm

After enabling the Verbose logging option for the workers, I get the following additionnal details:

2024-01-31 00:08:22:  Scheduler - Scheduler - Acquired limit '65b7af4d04342f53d7551727'
2024-01-31 00:08:22:  Scheduler - Job scan acquired 0 tasks in 0s after evaluating 134 different jobs.
2024-01-31 00:08:22:  Scheduler - Limits held after Job scan: []
2024-01-31 00:08:22:  ERROR: Scheduler Thread - Unexpected Error Occurred
2024-01-31 00:08:22:  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2024-01-31 00:08:22:  Exception Details
2024-01-31 00:08:22:  NullReferenceException -- Object reference not set to an instance of an object.
2024-01-31 00:08:22:  Exception.TargetSite: Deadline.Jobs.Task DequeueNextTask(System.String, Int32, System.String, System.DateTime, Boolean ByRef, Boolean ByRef)
2024-01-31 00:08:22:  Exception.Data: ( )
2024-01-31 00:08:22:  Exception.Source: deadline
2024-01-31 00:08:22:  Exception.HResult: -2147467261
2024-01-31 00:08:22:    Exception.StackTrace: 
2024-01-31 00:08:22:     at Deadline.StorageDB.MongoDB.MongoJobStorage.DequeueNextTask(String jobID, Int32 preferredTaskID, String slaveName, DateTime networkDateTime, Boolean& triggerStartEvent, Boolean& triggerFailedEvent)
2024-01-31 00:08:22:     at Deadline.Controllers.JobQueueController.DequeueNextTask(String jobID, Int32 preferredTaskID, String slaveName, DateTime networkDateTime, Boolean& triggerStartEvent, Boolean& triggerFailedEvent)
2024-01-31 00:08:22:     at Deadline.Controllers.DataController.AssignTasks(TaskAssignmentParameters parameters, DeadlineNetworkSettings networkSettings, SlaveSettings workerSettings, Action`2 loggingDelegate)
2024-01-31 00:08:22:     at Deadline.Controllers.DataController.DequeueTasks(SlaveState slaveState, SlaveSettings slaveSettings, Boolean verboseLogging)
2024-01-31 00:08:22:     at Deadline.Scheduling.SchedulerUtils.DequeueTasks(DataController dataController, SlaveState slaveState, SlaveSettings slaveSettings, Boolean verbose)
2024-01-31 00:08:22:     at Deadline.Slaves.SlaveSchedulerThread.DequeueMaxTasks()
2024-01-31 00:08:22:     at Deadline.Slaves.SlaveSchedulerThread.ThreadMain()
2024-01-31 00:08:22:  <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

mois.moshev · January 31, 2024, 8:19am

I think there might be additional logs. It’s hard to say anything about this info.
Could you try recreating your limits? Not sure if related, but the limit is acquired, and then no limits are held, which may be suspicious.

Jerome_Samson · February 1, 2024, 4:13pm

We finally found out our issue.
We have a workflow where we submit 3 jobs for one render sequence. Each one treats a part of the whole sequence but will be scheduled differently from the others.

The error is that we actually duplicate a job after submission and then modify its 2 “copies”.
We fail to properly copy the post job task which is created by deadline at submission BUT we kept the post job script attribute and the total number of tasks (which also counts the post task).
Thus the job representation in DB is incorrect.

Workers eventually identify this job to render, they start dequeuing tasks but when reaching the very last task, the scheduler thread raises an error as the index of the “last” task actually doesn’t return anything from the DB for this job.
Unfortunately this errors isn’t raised as a task error or a worker error.
After some time, all workers will try to pick up this job and dequeue the last task and fail indefinitely…

While we shouldn’t break the integrity of the database when submitting jobs, I believe an error showing up in the worker scheduler thread should not go unnoticed like this.
Could this be detected as a failure related to the task and trigger a failure of the job ?

mois.moshev · February 1, 2024, 9:55pm

Yeah pre/post scripts are tricky. I don’t believe you can make it count as a normal failure.
As I said there might be more logs e.g. on Windows $ProgramData/Thinkbox/Deadline10/Logs. There are deadlineslave* logs and also something like render-thread*.

eamsler · February 8, 2024, 7:35pm

Woah! Does this kill the SchedulerThread entirely?! That’s both neat and would maybe explain a long standing “Zombie Worker” problem where it’ll happily stay alive but never dequeue anything ever again.

I actually just reported a bug where if you set a job property to “None” it could crash during an equality check in job.JobProperties (a property which programmers should never need to see). It basically dies within Deadline.Jobs.Job.Equals().

I’m going to throw this on that ticket.

Last time this was reported it went quiet so thanks for not only figuring it out by reported on it! I broke things by setting job.ExtraInfo0 = None but I do not recommend that at all.

Slave STDOUT handler