The Interrupt System conflicts with Failure Detection

freed_kenneth · December 13, 2019, 3:16am

In short, we have implemented both the Interrupt System and the Failure Detection as the same time.
The issue is, when a job got too many errors from a slave and marked that slave as Bad, the job is still trying to get that slave to render. Hence, the job interrupts others. And once the job interrupted others and got the slave, the job will then release the slave due to it’s marked as bad.

The job is looping from getting the slave to releasing the slave. This is an infinity loop.

Is this a bug from Deadline or is it something wrong in my settings?

My Deadline version is 10.0.27.3

Justin_B · December 13, 2019, 3:29pm

Based on your explanation, you don’t have any job failure detection set up. Or your error count is maybe a little too high.

Ideally that job should get failed after it generates too many errors, which will stop that loop.

Feel free to share what your settings look like if I’m off the mark here.

freed_kenneth · December 17, 2019, 4:15am

Sorry about my misleading words. What I’m saying is “Slave Failure Detection” and I have set it to 5.

Let me write a you an example of the issue:
Given Job A has 99 priority and Job B has 50 priority.

Slave A is rendering Job A. Next, Slave A got 5 errors on Job A, and Job A is marked Slave A as a Bad slave. Job A releases Slave A.
Slave A is searching for an available job and found Job B.
When Slave A is just started on Job B, Job A interrupts Slave A on Job B because Job A has higher priority.
Job A interrupted Slave A and knowing it’s a Bad slave so Job A doesn’t get Slave A.
Slave A is repeating step 2 to 4 until Job A is finished.

I hope you can understand the issue.
Thank you.

freed_kenneth · February 10, 2020, 11:07am

I’m sorry to make a reply.
Is there any news about my issue?