At previous places I’ve worked we’ve had hosts suddenly go bad (hard drive, network, or memory failure) and it will chew through jobs and marking them all errored in a few seconds. I see Deadline has an option where slaves will be marked bad for the job after a number of errors, but I’d like to disable the slave with a comment (and email us) with a consecutive number of any jobs error on a single host.
The way it was implemented previously was using the equivalent of a post-task script that would run on success or error (running on the host). On error it would touch a temp file, if more than N number of temp files existed on that host it would close itself down and flag the admins. On success it would delete any temp files. This system worked well because it was decentralized and worked fairly well with multiple jobs on the same host.
I have that written, but I’m not sure how to trigger it. Post Task Scripts only seem to run on success, OnJobFailedCallback is only run once for the job, not each task. There were threads about adding OnTask callbacks to Deadline, but there were concerns about performance (which isn’t a concern for me).
Any suggestions?