On 10 nodes , i’m running a render with Max2008 X64 and Vray plugins.
Render start without any problem.
On one of these node, Max2008 is failling to load vray pluging.
Deadline 3.0 stop all others render on other node with this issue in slave log:
Info Thread - Cancelling task because task filename "\\deadlinebeta\DeadlineRepositoryBeta\jobs\999_080_999_0aeec3ac\Rendering\999_080_999_0aeec3ac_00017_18-18.Blade266001" could not be found, it was likely requeued sending cancel task command to plugin
Its seems that once a node is failling, all files on repository in JOBID\Rendering… are deleted.
Does this problem only happen after that one render node fails on a job because it can’t load the Vray plugin? If you blacklist this one machine, do the jobs still have this problem? When you check the job folder in the repository, can you check the other task status folders (ie: Complete, Queued, Failed, etc) to see if those task files have moved somewhere else? When you refresh the job in the monitor, what do you see? Finally, do you only see this problem with max/vray jobs, or is it with all of your jobs?
The only thing that immediately comes to mind is that the one bad slave is causing the job to fail, which moves the tasks out of the queued/rendering state to the failed state. If this isn’t the case, any additional information you can provide would be helpful.
I resolved my plugin issue, and now i test with another way. I start again the same render, start is fine. i wait some minute , and then i kill the max process on a node. same result. every render are stopped.
i rename a vray dll on a node to simulate a vray plugin issue. i started my render on 9 node. the “failing node” is disable. After these 9 node has rendered successfully 1 task, i enable my “failing node”: same issue, all task stopped.
during render all “lock” file are in Queued. Rendering frame are in Rendering folder . Once a node is failing, all “lock” file moved to Failed folder.
Job changed status from Active to Failed. Same for Tasks
OK, so at least we know that the job is failing, which is a normal Deadline operation. Now we just need to figure out if the job should be failing in the first place. In the Monitor, enter super user mode, and select Tools -> Configure Repository Options. Can you let me know what the values are for the five settings under the Failure Detection setting?
Yup, you’ll definitely want to change these settings. The way it’s currently configured, every job will fail after a single error. The default settings are:
0
5
0
0
0
If you want to use failure detection, you should increase these values. Here at Frantic, we use:
100
5
0
50
0