Thoughts on retry "fuzzing" for slaves

nrusch · March 25, 2015, 12:02am

Right now, if a non-fatal error occurs on a task, the task will be re-queued, and the same slave will retry it immediately. It would be really nice if there were some logic to prevent the same slave from hammering the same task repeatedly. Could be as simple as a sleep delay between tries (to allow another slave to possibly jump in) and/or causing it to jump to the next available task (if any are available, and if that job still qualifies as the next candidate for dequeueing).

We have our repository settings such that a task is marked as Failed after 4 errors, and a slave is marked as “bad” after 10 consecutive errors for the same job. This means that a slave that goes “rogue” can fail 2 tasks completely without giving any other machine a chance to touch them.

Thoughts on this idea?

dwallbridge · March 25, 2015, 8:55pm

Hello Nathan,

Thanks for the suggestion. I will pass it by the devs and see what they thing, but I think it sounds like a decent feature.

Cheers,