I would definitely be interested in executing code on slaves with X consecutive failures.
Events.DeadlineEventListener’s OnJobFailedCallback seems closest to being suitable for the job except this would mean a job is already dead before we can act to resolve an issue. I can handle the tracking of what/how often something failed myself but I would need an an OnJobErrorCallback instead.
The perfect use case would be a single slave lighting up the queue with red so it takes an environment dump, current usage stat dump, whether the machine can touch network resources, tries to resolve issues and/or finally reboots. If it continues to fail, we can disable it programatically.