Slaves requeuing tasks without rendering

maxplugins · May 13, 2016, 4:06pm

I had a long chat with Mike Owen about this yesterday, and I was hoping that we just didn’t have the latest patch for v7 installed, but…

We’re running 7.2.4.0, and we’ve been having a problem for a while now that slaves get a task assigned, and then they just sit there for a while, and then they drop the task without rendering (and without generating an error) and do something else.
I updated the farm to 7.2.4.0 on April 21st, and it made no difference at all.

I’ve attached a log file for one of the slaves from a day that had a lot of problems.
I’ve also attached a screenshot of the slave report for that slave. It shows the problem pretty well: everytime there is a FontSync, there should be a task following it. But as you can see there are sometimes 8 or 9 FontSync entries with no tasks between them.

This affects all of the programs we are running on the farm: Max, Nuke, C4D, After Effects.
If you need any more information let me know.

Dave
deadlineslave-schimpanse25-2016-05-11-0000.log (722 KB)

dwallbridge · May 13, 2016, 9:35pm

It looks like there is a lot of tasks being requeued mid render, which is likely the cause of the issue. This probably also explains the lack of logs, because the jobs aren’t really being completed or failing. Do you know why they are being put back into queued status? The job history would be helpful in seeing if someone is doing this.

maxplugins · May 16, 2016, 1:01pm

They’re not being requeued mid-render, they are just sitting there way too long saying “Waiting to render…”, and then they are no longer assigned to the task. The render never actually starts.

As to why they are being put back into queued status, that is why I posted it here. It’s definitely not someone else requeuing the tasks, this is being done by Deadline for some reason.

If you tell me exactly what you need for logfiles etc. I can sort them out for you tomorrow.

Cheers,

Dave

dwallbridge · May 17, 2016, 7:56pm

Hello David,

In several cases the tasks were requeued after only a few seconds, and I can’t think of anything in Deadline which would do this. If you can pull out the Job History, which is detailed in docs.thinkboxsoftware.com/produc … nd-history and send over one or two jobs that had this issue in the time frame shown in the history, we can try to see what might have been triggering the requeue.

maxplugins · May 18, 2016, 7:37am

You’ve got me confused now, there is nothing to do with my problem in the Job History because the problem I’m having doesn’t create an entry in the Job History. It doesn’t show up there at all, only in the individual slave logs.

Just about every job here has the problem to some degree, but the problem you can see in the slave log has never shown up in the job history.

By the way, it may be that the tasks are sometimes being requeued after a few seconds, but that isn’t the norm. The whole “Waiting to render…” thing usually lasts anywhere between 3 and 10 minutes.

In any case, I’ve attached the logfiles for one of the jobs that had the problem as shown in the slave log in the first post.
If you need anything else, just let me know.

Thanks,

Dave

Edit: The job I’ve attached the log files for has got two requeue entries, but I did that manually because I needed the machines for something else.
Reports_Job.zip (1.39 MB)

MikeOwen · May 18, 2016, 10:16am

Hi Dave,

Sorry for the confusion. Dwight is referring to the job log reports, which you posted. Thanks!

I took a look through your job log reports and everything is just peachy here. If you haven’t already got verbose logging enabled, can you go to Monitor -> Super-User -> Tools -> Configure Repo Options -> Application Data -> and enable all the verbose logging checkboxes.

With the above enabled, can you wait for a job to display this issue and then on a couple of the slaves which have this issue, can you navigate to them (remote desktop, etc) and grab the logs from the location shown here:
docs.thinkboxsoftware.com/produc … ation-logs

It should be, if your using the defaults as:

C:\ProgramData\Thinkbox\Deadline7\logs

Can you zip up each machine’s logs and send it over?

Just a random stab in the dark, but could you confirm the Operating System being used on the MongoDB machine and the Deadline Repository (if they happen to be on different machines)

Thanks!
Mike

maxplugins · May 18, 2016, 2:59pm

OK, verbose logging is now on.
As soon as I have a good example of the error, I’ll post it here.

Cheers,

Dave

maxplugins · June 2, 2016, 3:28pm

Hi Mike,

I finally had time to get a few log files sorted out.
The problem wasn’t so bad the past few weeks, but this week it’s been really bad.

The log files go from 11 AM to 4 PM today, and each of them has around 20 re-queues.

You asked about the OS: Mongo and the Repository are on the same machine, and it’s currently running Windows Server 2008 R2.
An update to Windows Server 2016 is on the cards, but I’ve got no idea when it’s actually going to happen.

Let me know if you need any more information,

Dave
deadlineslave-schimpanse55-2016-06-02-0003.zip (92.7 KB)
deadlineslave-schimpanse30-2016-06-02-0002.zip (88.3 KB)

MikeOwen · June 6, 2016, 9:46am

Hi Dave,

Thanks for the logs! OS versions are good here, no need for an update here to resolve this issue.

Some initial questions that come to mind:

Is every Deadline slave (and Pulse) running the same Deadline version? A slight version mismatch might cause odd issues here. The “Slave” Panel and “Pulse” panel both have a “Version” column. Is it at all possible, you have an old Deadline 6 Pulse running somewhere on the network, maybe as a service that everyone forgot about until now?
Are there any stalled slave reports for the slaves that this happens on?
Just checking as I did see a few manual re-queues by users in your provided logs; that all the other requeues (~20, split between the 2 reports) were definitely not carried out by a human here?

maxplugins · June 7, 2016, 12:57pm

Thanks for taking a look, Mike.

Good call on the Pulse version, our Pulse was still running 7.2.0.18.
That has now been updated and everything is on 7.2.4.0

There are no stalled slave reports, Deadline doesn’t seem to register anything as being wrong when this happens.

I’m definite that they are not being requeued by users.
The reason that some of them in the logs were done by hand is due to the fact that if we hadn’t done that, the jobs would never get rendered in time.
Even if there are just 10 requeues per logfile when you subtract the manual ones, that is just 5 hours on two machines. 24 hours on 80 machines adds up to a huge amount of wasted time when the tasks keep requeuing.

Now that Pulse has been updated I’ll keep an eye on things to see if that was the problem.

Thanks,

Dave

MikeOwen · June 7, 2016, 1:56pm

Awesome! Cheers Dave. Yeah, my money’s on the old version of Pulse causing the trouble here.

Let us know how it goes in a while sir!

maxplugins · June 27, 2016, 2:33pm

It looks like this was down to the old version of Pulse running.
We haven’t been rendering that much recently, but the problem seems to have stopped.

Thanks for helping out!

Dave

MikeOwen · June 27, 2016, 2:42pm

Awesome! Thanks for the feedback Dave. Happy rendering!