Slave failure detection

I have a machine that’s not being blacklisted after failing 20 times (my settings) on a job. This has been working in the past so I’m unsure what is going on now. None of my settings have changed. Basically the tasks keep failing on this slave and the Deadline keeps queueing up tasks to the slave until the job gets marked as failed. I was wondering if there are any bug reports for this?

The reason for the failures was because the proper version of our software wasn’t installed on the slave. So the tasks failed pretty quickly.

Thanks
slavefailureDetection2.JPG

Hello James,

Is the slave ever being marked bad? With your settings it should definitely not be staying alive after 20 errors.

No, it’s never marked as bad. This used to work but I’m not sure when it stopped working.

Hello James,

Can you verify the full Deadline version you are using? Can you also send over a full day’s slave log, either here or to the support email? Thanks.

There was a bug back in Deadline 6 days where we weren’t handling that properly.

I believe we calculate the ‘errors in a row’ based on the job report data.

Just for fun, what version of Deadline are you using? I might go digging into the source code there.

Hi Dwight and Edwin. So I’ve attached the slave logs for over two days from the slave that was causing the issues. The version of Deadline we’re using is:

Deadline Version: 7.2.1.10 R (b8ff445b3)
CAWS5_logs.rar (1.04 MB)

Hey James,

I am noticing at the top of those logs that the version listed is 7.0.2.3, not 7.2.1.10. Is there any chance you would be able to upgrade the node and see if this problem persists? Its always good to make sure machines in your farm are all on the same version as your repo as a standard practice.

Whoa! Good catch. That machine is indeed running 7.0.2.3. I’m not sure why it hasn’t updated because we’ve been on 7.2 since it was released. I have Deadline set to distribute updated versions. Is there any reason why a slave wouldn’t grab the update?

Yeah… 7.0 isn’t compatible with the auto-upgrader because of some changes we needed to make to the Repository (restructured the ‘bin’ folder).

Good to know. Thanks guys!