AWS Thinkbox Discussion Forums

StartJob Errors not adding to Bad Slaves

Hi,

We have an issue where slaves are not being adding to the bad slaves list after multiple errors. This is only on slaves that generate StartJob errors.

We have our Bad Slave detection set to 3 errors, but we’re generating many times this and the slaves are never marked as Bad.

Deadline Version: 6.2.0.22 R

2014-04-24_11_01_56-Job Reports - Errors.png

I just tested with 6.2.0.24, and I can’t reproduce this problem. Couple of questions:

  1. What do you have set for the Frequency setting in Repository Options -> Job Settings -> Failure Detection -> Slave Failure Detection? If this is anything but 0, a slave will reattempt bad jobs if there are no other good jobs.
  2. In this job’s properties, under Failure Detection, is Ignore Bad Slave Error Limit enabled? If so, that would also explain this behavior.
  3. Did you just recently set the limit to 3 errors? If so, it can take up to 10 minutes for those settings to propagate to the slaves.

Cheers,
Ryan

Hi,

  1. Our Frequency is set at 0%, and has been for at least 12 hours prior to the errors in the image. I changed it to 0% hoping it would fix the issue, but it hasn’t.

  2. The jobs do not have Ignore Bad Slave Error Limit checked.

  3. The bad slave detection has been set to 3 for at least a day. It was only 5 prior to that though.

Thanks,
A

Interesting. What happens if you restart the slaves? Maybe the slaves aren’t re-loading these settings in properly while they’re already running…

I’ve restarted the slaves and they still won’t add to the bad list.

Weird. We’re going to take a look through the code and see if we can find anything that might explain how this could happen.

Cheers,
Ryan

I think we’ve tracked this down. The fix we’re implementing will be included in the next beta release, so you’ll have to test it and let us know if you continue to see this problem.

Cheers,
Ryan

Awesome, thanks. We will let you know.

-A

Hi,

We’re on 6.2.0.26 R and still having this issue.

Thanks,
A

Hmm, we tested this quite a bit before 6.2.0.26 was released, and it seemed to be working just fine. Maybe the issue we fixed is different than the one you’re seeing. I could ask you to upgrade to the release version of 6.2, but I doubt any changes made between then and now would make a difference.

Just to confirm, are all slaves running 6.2.0.26? You can tell in the slave list in the Monitor by looking at the Version column. Also, were the jobs affected by this submitted before or after upgrading to 6.2.0.26? Not sure if it matters, but good to know nonetheless.

Finally, can you upload an example of a StartJob error message from a job that is affected by this issue?

Thanks!
Ryan

Ahh, it does look like a few of the slaves failed to update! I think that was the issue then. Thanks!

Privacy | Site terms | Cookie preferences