I have job failure detection settings set up to mark a job as failed after M errors, and mark a task as failed after N errors. However, Deadine does not seem to count all errors toward these limits.
The specific example I have is a missing plugin:
Could not find plugin named "XXXX", path does not exist: repo_path/plugins/XXXX
If these occur, slaves will keep hammering away at the same job forever, rather than respecting the failure thresholds defined in the repository settings. I don’t know if there are other types of errors that ignore the failure detection settings, but I will add them here if I run across them.
Thanks for reporting this! I have a feeling it’s because of the timing of this error. Just to confirm, when this error happens, do error reports get generated for the job? Also, does the job’s error count increase?
Just had a job that ran up over 48,000 errors like this over the weekend and prevented any other work in that pool from getting done. Seems we’re going to need one or more crons to keep an eye on Deadline itself.
Well, the good news is that this bug has already been fixed in Deadline 7. The beta just started yesterday if you want to play with the initial build. Our beta application process has changed a bit, so you’ll have to reapply by sending an email to beta@thinkboxsoftware.com.
Is there any chance the fix for this could be backported to 6.2.1? We had an extended power loss over the weekend that took out one of our main farm switches, and even though things came back, Deadline did not handle it at all. We had multiple jobs with 20,000+ of these errors, and the farm was completely useless until the launcher and slaves were all manually killed and restarted this morning. Luckily we have MCollective available, otherwise I may have lost my mind…