Hi,
I have a very frustrating issue where some of the slaves refuse to render and cancel their tasks. The errors say that they are stalled.
It’s a mix between powerful and weak machines so I dont think its a performance issue, also i can access them just fine on my network at any time. The error is also occurring on different jobs.
The ones in question are 5 slaves out of 18 slaves. I attached two of the logs since they all have the same error, which is:
" 2012-06-09 06:21:13: Scheduler Thread - Cancelling task because task filename “\server\Dead Line\jobs\999_100_999_1311ad30\tasks\999_100_999_1311ad30_00014_300-309.Rendering.Farm4” could not be found, it was likely requeued
2012-06-09 06:21:13: sending cancel task command to plugin"
Also for an all in all 18 slave farm including user machines, is it necessary to activate Pulse? keeping in mind that rarely all user and farms machines are rendering in the same time.
Slave_Logs.zip (1.99 MB)
Can you post a few stalled slave reports from the jobs that this problem occurred for? We can look to see why they are getting marked as stalled.
One thing to check is that all of your machines date/times are in sync (including time zones). If they are not, this can cause false positives for stalled slave detection. The stalled slave reports can help confirm if this is the case or not, because if it is, the amount of time the slave has been “stalled” for is almost exactly X hours, where X is the time difference.
We are working on fixing this for Deadline 6, so that stalled times aren’t affected by machines with different date/times.
Cheers,
Interesting… Will check this and get back to you.
Thanks
I checked all the machines and their date/time settings are all good and synced.
I attached some of the log files, please check them.
After i enabled Local rendering, the stalled error is now minimized to 2 farms.
But over all the stalled error for all the farms comes from this happening :
Cancelling task because task filename “\server\Dead Line\jobs\999_100_999_71ae831c\tasks\999_100_999_71ae831c_00005_655-655.Rendering.Farm1” could not be found, it was likely requeued
2012-06-13 16:06:29: sending cancel task command to plugin
//server/ is where the repository lives for us.
Slave_Logs.zip (108 KB)
That’s interesting. It sounds like you might be having a network load problem. By enabling local rendering, that would have greatly reduced the impact on your network because now the rendered images are only sent over the network when they’re finished, rather than being written on the fly over the network.
This issue you’re seeing can occur when network load prevents the slave from being able to access it’s current task file on the repository. We have made improvements to this system in 5.1, and I noticed you are running 5.0, so upgrading will likely help here. If it doesn’t, at least more verbose information is printed out when the problem occurs.
I should also mention that this problem should no longer be an issue in Deadline 6, which we’re working on internally. We’re changing the way slaves keep track of their tasks.
Cheers,