Retrying A Job/Task after failing automatically


#1

Hello there,

Since I am trying to set up deadline do our bidding, I am also wondering how to properly setup the jobs to be retried at least once after failing. We currently have the Job Override to 1 and Task Override to 2.

But when tasks fail they are not requeing they just fail. Is there another setting that does that?

Thank you


#2

We don’t usually need the job specific overraids. You can setup your global failure settings from Monitor --> Tools --> Configure Repository Options --> Job Settings --> Failure Detection

https://docs.thinkboxsoftware.com/products/deadline/10.0/1_User%20Manual/manual/failure-detection.html#job-failure-detection

Ours are currently setup like this.


#3

yes but this is just setting when a job should fail. What I want is that when a job fails, to automatically reque at least once.


#4

@Panayiotis

in @panze 's image, you want to change the setting “Mark a task as failed if it has generated this many errors” = 2
this would allow each task to run twice before marking it as a fail.

You can also set this on a job level, instead of the repository overriding everything:

hope this helps.

edit: pasted the wrong image… corrected


#5

Hello,

Thanks for the help. So now we have the Override Task Error Limit at 2. Which as I understand should retry the job at least once. Please do correct me if I am wrong.

But when a job is running and fails, after looking at the history I don’t see the job requeing. Is deadline just retrying without putting that in a log? Shouldn’t I see in history or the job logs a retry from deadline?


#6

I don’t see that much benefit in a job getting requeued over task getting few extra tries. In fact with reque you’d be also re rendering all successful tasks, which I think is potentially considerable resource waste. What are you hoping to achieve with reque that you don’t get with having few extra tries after task error?

I don’t think that there’s a native functionality for what you want. But, you could easily make one with OnJobFailedCallback event script, if needed.


#7

hello, i am sorry for being unclear, I don’t want the whole job to get requed just the failing tasks. Also just to clarify, when the tasks retry, do they leave a footprint? i mean can i see it visibly in the logs that the task retried? The reason we want to reque some tasks it’s because we want to make sure that the fail is not a machine issue but something else. We want them to fail at least after being tried by a couple of different machines. This saves time in going everytime and checking all the tasks that failed.

So the idea is that a task will stay failed only if it has been retried at least once or twice

So far everytime I go to a failed tasks reports and history I see that It was only tried once. Nothing is logged as been retries or requed, even though we have the task error override to 2

Hope this makes sense


#8

Ah. Ok. That clarifies it a lot more. Basically task only fails after it has tried all its allotted number of errors (you can see the number of error reports for that task).

But what you want is to mark the slave bad for that job (consult the image I posted earlier) and have that number smaller than what it takes to mark a task as failed.

So in our case task can error out 5 times until marked failed, but slave can only fail 4 times (in a row) until it’s blocked from rendering that job. So then another slave will have a one go at the task and only when that also fails will the task be marked bad.


#9

And if that is still not what you want, then I’d check out the event scripts and build event that will collect all the failed tasks from a job and reque them. You’d also probably want to have some sort of tolerance cut off built in so that you don’t essentially end off re-rendering a job that is just faulty to begin with.


#10

Thank you actually that helps a lot.

Just to clarify though.

When a task fails as you said it fails only when the amount of retries has been reached.
Where can I actually see the amount of retries? Will it appear in the job report that that task has been tried 3 or 4 times for example?


#11

You can see it from the amount of errors (and error reports) task has generated. One place to view that is in the monitor at jobs task list window. Not sure if it’s visible somewhere else as well.