[feature request] slave failure throttling

We have a custom system in place that auto-disables slaves that are consecutively erroring (for any reason). Figured i would pop the idea here, as it may be useful for others too. Basically each slave has a counter for consecutive task failures that gets reset when a task completes successfully. If the counter reaches 10, a warning email is sent out to the wranglers, and when it reaches 20, the slave gets auto-disabled, with a comment added to the slave’s description about the auto-disable.

This allows us to catch rogue machines very quickly and also helped us keep the jobs clean in the rare events of larger infrastructural problems (as the farm essentially would shut itself down instead of failing every single job).

Hey Laszlo,

I’ll add this to the list of core feature requests.