This is pretty bad as it won’t allow for a release of the entire job and thus the concurrent jobs won’t start.
I don’t really see any reason why Deadline should report that the tasks are still running.
I even had Pulse shutting down slaves on the predefined idle time (via power options), but a job was still reporting some of the machines as rendering!
Hi Lukas,
Hanging tasks/frames is just the nature of network processing Could be anyone of a million variables causing the ‘hang’.
Are you using “auto task timeout”? If not, this is awesome in solving these little problems. You can wire this functionality up to be enabled by default for particular plugin type jobs and then globally control the settings such as a time multiplier of x3 and only kick in this setting when the job is at 90% completion. We don’t get any hung frames overnight ever again! Although, I would urge you to always check the re-queue log reports for excessive re-queuing as this means there is something else not quite right with whatever plugin/job you may be processing.
Finally, I don’t really like the pre-defined slave restart/shutdown schedule solution as it introduces in-efficiencies for network processing, but I totally understand why you are using it to solve the above mentioned issues…I’m jus wondering if there’s a better way to get the same result
Mike
Thanks for the tips, I’ll check the auto task timeout options.
But, what do you mean by “pre-defined slave restart/shutdown schedule”?
I use Power Management for shutting down idle (after 3 hours) slaves to save power. Nothing else. I don’t use it to “solve” any issues in Deadline, only to save power as my renderfarm isn’t always up for 100% all the time.
Cool.
Ah, sorry, my misunderstanding.
I thought you were forcibly restarting your farm at set times to ensure you don’t get any ‘stuck frames’, which of course, would lead to restarting a machine when it might be processing a job, which would result in wasted time.
Yeah, power management rocks. WOL singlehandly has saved us big bucks over the years. We like to have 2 schedules for WOL; weekday core business hours (2 hour shutdown policy) and out of hours; evening/weekends (30 mins shutdown policy).
HTH,
Mike
Yep, one of my most favourite features in Deadline indeed. I’ve estimated to save 25%-55% on electricity this way, which is huge! Especially with the render farm slowly growing in size.
Also, I just enforced the Auto Time Outs after 90% of job completion, so, let’s see if it works.
I noticed it usually (not exclusively, though) happens on tasks that take very little time to complete. Around 3-5s were these (it was a simple Nuke job). So, there might be something in the network, or… well, anywhere, as you pointed out. That’s why I’d really like to see Deadline moving towards a SQL architecture. I’m not much of a fan of those tons of little files living on the actual file system.
Hi,
I think after some experimentation you will probably be able to raise this to 95% or so, but time will tell
Yep, I find similar results when we have very fast, circa <10 secs per task kind of task activity. I don’t believe this is Deadline, but rather other inefficiencies in either the software being used and/or also our network getting a hammering combined with other heavy i/o happening on multiple file servers/SAN. In this situation, simply ‘chunking’ the tasks into say 5 or 10 chunks resolves these issues. Of course, what we don’t have yet, is the ability for Deadline to be AI aware and auto-chunk or auto-de-chunk when it notices super fast frames or frame ranges which are very heavy and hence need to be split up further for more effective processing.
Mike
The issue is that the slave is losing track of the task. Because the slave is responsible for monitoring the render time of its tasks to determine if a timeout occurs, a task that no longer has a slave associated with it can not timeout.
When a slave is rendering a task, it will check periodically if the task file it is working on “still exists”. If it doesn’t, then under normal circumstances that means that the task has been requeued or the job has been deleted, in which case the slave should move on.
To prevent false positives (ie: due to a disconnection from the repository), the slave will only assume the task has been requeued if it can still access the task’s job folder, and will only assume it has been deleted if it can access the “jobs” folder in the repository and the task’s job folder no longer exists.
However, it seems like there is a possibility for a task file to not be found when it is actually there. We think this is network related, but obviously still unacceptable. As Chris mentioned in another of your threads, we have a plan to deal with these types of issues.
Just to confirm the problem you’re seeing is what I think it is, please enable slave verbose logging if it isn’t already. Then restart your slave applications so that they recognize the change immediately. The next time this happens, go to the slave machine that lost the task and find the slave log from that session (in the slave, select Help -> Explore log folder). The slave will actually print out that it can’t find the task, and will dump the contents of the job’s task folder to show what it is seeing. Please post the log and we’ll take a look.
This is becoming a really, really big problem for us too. It is particularly bad for things like tile assembly jobs where the tiles are deleted afterwards, so if the task is resubmitted it will just error out as it cannot find the tiles. We have a series of queued up jobs that are each dependant on the last. I have also noticed it is particularly bad with faster tasks.
I am not sure if we are affected more because we predominantly deal with rendering stills using tile based rendering. I signed up for the beta thinking that it might be a fix to the problem but it looks like the beta will expire in early December so it is not a good solution for us and I did not go ahead with it. I seem to be doing more and more babysitting of the renders and releasing pending tasks etc just to make sure jobs are completed.
Have people been able to fully fix this or are there just workarounds for it? It seems like such an odd thing for a task to not report that it is complete?
The beta licenses expire at the end of December because Deadline 5.1 will be released mid-December. As long as you are on active subscription, you will be entitled to a permanent 5.1 license as soon as it is released. This task issue has been addressed in 5.1, so we highly recommend using this version.