Hi,
We have an issue where slaves are not being adding to the bad slaves list after multiple errors. This is only on slaves that generate StartJob errors.
We have our Bad Slave detection set to 3 errors, but we’re generating many times this and the slaves are never marked as Bad.
Deadline Version: 6.2.0.22 R
I just tested with 6.2.0.24, and I can’t reproduce this problem. Couple of questions:
- What do you have set for the Frequency setting in Repository Options -> Job Settings -> Failure Detection -> Slave Failure Detection? If this is anything but 0, a slave will reattempt bad jobs if there are no other good jobs.
- In this job’s properties, under Failure Detection, is Ignore Bad Slave Error Limit enabled? If so, that would also explain this behavior.
- Did you just recently set the limit to 3 errors? If so, it can take up to 10 minutes for those settings to propagate to the slaves.
Cheers,
Ryan
Interesting. What happens if you restart the slaves? Maybe the slaves aren’t re-loading these settings in properly while they’re already running…
I’ve restarted the slaves and they still won’t add to the bad list.
Weird. We’re going to take a look through the code and see if we can find anything that might explain how this could happen.
Cheers,
Ryan
I think we’ve tracked this down. The fix we’re implementing will be included in the next beta release, so you’ll have to test it and let us know if you continue to see this problem.
Cheers,
Ryan
Awesome, thanks. We will let you know.
-A
Hi,
We’re on 6.2.0.26 R and still having this issue.
Thanks,
A
Hmm, we tested this quite a bit before 6.2.0.26 was released, and it seemed to be working just fine. Maybe the issue we fixed is different than the one you’re seeing. I could ask you to upgrade to the release version of 6.2, but I doubt any changes made between then and now would make a difference.
Just to confirm, are all slaves running 6.2.0.26? You can tell in the slave list in the Monitor by looking at the Version column. Also, were the jobs affected by this submitted before or after upgrading to 6.2.0.26? Not sure if it matters, but good to know nonetheless.
Finally, can you upload an example of a StartJob error message from a job that is affected by this issue?
Thanks!
Ryan
Ahh, it does look like a few of the slaves failed to update! I think that was the issue then. Thanks!