Jobs Failing from Slave Problem

Mike_Truly · January 25, 2013, 11:58pm

I am having jobs fail because one slave has some problem. Never had this problem using BB and it’s very frustrating.

Is there a setting somewhere that will tell Deadline to not fail and stop rendering the job simply because one slave has a problem? Isn’t there a way to alert me to the problem with a slave without stopping the other slaves from rendering? Should I be bumping up the error limit to some high number or something?

Thanks for any ideas.

Bobo · January 26, 2013, 12:12am

Normally, if one slave fails N times (default is 5 I think), it should be marked as bad and stop trying.

Also, the job should send you an email (if you have enabled email notification) after 50 errors that the job is having issues.
By default it should fail after 100 errors (and also send you an email about it).

You can completely override the error detection per job, or in SMTD under the Job Failure Detection rollout which was added very recently. The SMTD Overrides should be sticky between sessions (but that can be changed if needed).

When you have a farm with 250+ nodes and thousands of jobs per day going through, you don’t want a job that is consistently erroring out to generate thousands of errors and keep the slaves occupied instead of letting them to move on. That’s why we have these defaults. But you should be able to customize it all…

Mike_Truly · January 26, 2013, 12:34pm

Bobo, thanks for the ideas.

I found the setting down in:

Monitor>Tools>Configure Repository Options>Job Settings>Failure Detection>Mark a job as failed after it has generated this many errors

Once turned off, the job is no longer failed because of a single problem slave.

The slave on my workstation (which has successfully rendered Deadline jobs previously) errored because it couldn’t load MAX2010_64 for some reason. I rebooted the system and tried again to no avail. I rebooted again and then it worked. I’ve attached the error screenshot.

Thanks again.