disable slave on consecutive job errors

pfranz · September 22, 2016, 1:15am

At previous places I’ve worked we’ve had hosts suddenly go bad (hard drive, network, or memory failure) and it will chew through jobs and marking them all errored in a few seconds. I see Deadline has an option where slaves will be marked bad for the job after a number of errors, but I’d like to disable the slave with a comment (and email us) with a consecutive number of any jobs error on a single host.

The way it was implemented previously was using the equivalent of a post-task script that would run on success or error (running on the host). On error it would touch a temp file, if more than N number of temp files existed on that host it would close itself down and flag the admins. On success it would delete any temp files. This system worked well because it was decentralized and worked fairly well with multiple jobs on the same host.

I have that written, but I’m not sure how to trigger it. Post Task Scripts only seem to run on success, OnJobFailedCallback is only run once for the job, not each task. There were threads about adding OnTask callbacks to Deadline, but there were concerns about performance (which isn’t a concern for me).

Any suggestions?

panze · September 22, 2016, 5:34am

Maybe OnJobErrorCallback?

eamsler · September 22, 2016, 5:51pm

+1 for the OnJobErrorCallback.

That will trigger whenever a task generates an error. It’s possible for Deadline to send e-mails on errors, so coding that might be a bit redundant. Adding up the errors and failing the job should be good.

For state, you can use RepositoryUtils.GetSlaveReports().GetSlaveReports() and see if they’ve generated X bad reports in a row to disable the Slave. I believe the ReportError on the log will be a good way to see if the report was a success or not.

docs.thinkboxsoftware.com/produc … d794a6b4ed
docs.thinkboxsoftware.com/produc … 673e73ac9f
docs.thinkboxsoftware.com/produc … 997cc31365

pfranz · September 26, 2016, 11:00pm

Thanks for the responses. I was able to set it up by doing a post-task script looking for successes and OnJobError() for failures.

I went with temp files on the local box for simplicity and to distribute things. If I had every single task has try to generate a report of the last ‘n’ jobs on a host I’d be worried about the undue load.