StartJob Errors not adding to Bad Slaves

PIXOMONDO · April 24, 2014, 6:04pm

Hi,

We have an issue where slaves are not being adding to the bad slaves list after multiple errors. This is only on slaves that generate StartJob errors.

We have our Bad Slave detection set to 3 errors, but we’re generating many times this and the slaves are never marked as Bad.

Deadline Version: 6.2.0.22 R

2014-04-24_11_01_56-Job Reports - Errors.png

rrussell · April 24, 2014, 7:58pm

I just tested with 6.2.0.24, and I can’t reproduce this problem. Couple of questions:

What do you have set for the Frequency setting in Repository Options -> Job Settings -> Failure Detection -> Slave Failure Detection? If this is anything but 0, a slave will reattempt bad jobs if there are no other good jobs.
In this job’s properties, under Failure Detection, is Ignore Bad Slave Error Limit enabled? If so, that would also explain this behavior.
Did you just recently set the limit to 3 errors? If so, it can take up to 10 minutes for those settings to propagate to the slaves.

Cheers,
Ryan

PIXOMONDO · April 24, 2014, 8:36pm

Hi,

Our Frequency is set at 0%, and has been for at least 12 hours prior to the errors in the image. I changed it to 0% hoping it would fix the issue, but it hasn’t.
The jobs do not have Ignore Bad Slave Error Limit checked.
The bad slave detection has been set to 3 for at least a day. It was only 5 prior to that though.

Thanks,
A

rrussell · April 24, 2014, 8:54pm

Interesting. What happens if you restart the slaves? Maybe the slaves aren’t re-loading these settings in properly while they’re already running…

PIXOMONDO · April 24, 2014, 11:15pm

I’ve restarted the slaves and they still won’t add to the bad list.

rrussell · April 25, 2014, 5:16pm

Weird. We’re going to take a look through the code and see if we can find anything that might explain how this could happen.

Cheers,
Ryan

rrussell · April 28, 2014, 3:43pm

I think we’ve tracked this down. The fix we’re implementing will be included in the next beta release, so you’ll have to test it and let us know if you continue to see this problem.

Cheers,
Ryan

PIXOMONDO · April 28, 2014, 9:39pm

Awesome, thanks. We will let you know.

-A

PIXOMONDO · May 30, 2014, 5:27pm

Hi,

We’re on 6.2.0.26 R and still having this issue.

Thanks,
A

rrussell · May 30, 2014, 6:12pm

Hmm, we tested this quite a bit before 6.2.0.26 was released, and it seemed to be working just fine. Maybe the issue we fixed is different than the one you’re seeing. I could ask you to upgrade to the release version of 6.2, but I doubt any changes made between then and now would make a difference.

Just to confirm, are all slaves running 6.2.0.26? You can tell in the slave list in the Monitor by looking at the Version column. Also, were the jobs affected by this submitted before or after upgrading to 6.2.0.26? Not sure if it matters, but good to know nonetheless.

Finally, can you upload an example of a StartJob error message from a job that is affected by this issue?

Thanks!
Ryan

PIXOMONDO · May 30, 2014, 7:48pm

Ahh, it does look like a few of the slaves failed to update! I think that was the issue then. Thanks!

Select your cookie preferences

Customize cookie preferences

Essential

Performance

Functional

Advertising

StartJob Errors not adding to Bad Slaves