Stuck tasks, why?

loocas · August 15, 2011, 9:46pm

Why does this happen?

This is pretty bad as it won’t allow for a release of the entire job and thus the concurrent jobs won’t start.

I don’t really see any reason why Deadline should report that the tasks are still running.

I even had Pulse shutting down slaves on the predefined idle time (via power options), but a job was still reporting some of the machines as rendering!

MikeOwen · August 16, 2011, 9:05am

Hi Lukas,
Hanging tasks/frames is just the nature of network processing Could be anyone of a million variables causing the ‘hang’.
Are you using “auto task timeout”? If not, this is awesome in solving these little problems. You can wire this functionality up to be enabled by default for particular plugin type jobs and then globally control the settings such as a time multiplier of x3 and only kick in this setting when the job is at 90% completion. We don’t get any hung frames overnight ever again! Although, I would urge you to always check the re-queue log reports for excessive re-queuing as this means there is something else not quite right with whatever plugin/job you may be processing.
Finally, I don’t really like the pre-defined slave restart/shutdown schedule solution as it introduces in-efficiencies for network processing, but I totally understand why you are using it to solve the above mentioned issues…I’m jus wondering if there’s a better way to get the same result
Mike

loocas · August 16, 2011, 9:30am

Thanks for the tips, I’ll check the auto task timeout options.

But, what do you mean by “pre-defined slave restart/shutdown schedule”?

I use Power Management for shutting down idle (after 3 hours) slaves to save power. Nothing else. I don’t use it to “solve” any issues in Deadline, only to save power as my renderfarm isn’t always up for 100% all the time.

MikeOwen · August 16, 2011, 10:16am

Cool.
Ah, sorry, my misunderstanding.
I thought you were forcibly restarting your farm at set times to ensure you don’t get any ‘stuck frames’, which of course, would lead to restarting a machine when it might be processing a job, which would result in wasted time.
Yeah, power management rocks. WOL singlehandly has saved us big bucks over the years. We like to have 2 schedules for WOL; weekday core business hours (2 hour shutdown policy) and out of hours; evening/weekends (30 mins shutdown policy).
HTH,
Mike

loocas · August 16, 2011, 11:40am

Yep, one of my most favourite features in Deadline indeed. I’ve estimated to save 25%-55% on electricity this way, which is huge! Especially with the render farm slowly growing in size.

Also, I just enforced the Auto Time Outs after 90% of job completion, so, let’s see if it works.

I noticed it usually (not exclusively, though) happens on tasks that take very little time to complete. Around 3-5s were these (it was a simple Nuke job). So, there might be something in the network, or… well, anywhere, as you pointed out. That’s why I’d really like to see Deadline moving towards a SQL architecture. I’m not much of a fan of those tons of little files living on the actual file system.

MikeOwen · August 16, 2011, 7:02pm

Hi,
I think after some experimentation you will probably be able to raise this to 95% or so, but time will tell
Yep, I find similar results when we have very fast, circa <10 secs per task kind of task activity. I don’t believe this is Deadline, but rather other inefficiencies in either the software being used and/or also our network getting a hammering combined with other heavy i/o happening on multiple file servers/SAN. In this situation, simply ‘chunking’ the tasks into say 5 or 10 chunks resolves these issues. Of course, what we don’t have yet, is the ability for Deadline to be AI aware and auto-chunk or auto-de-chunk when it notices super fast frames or frame ranges which are very heavy and hence need to be split up further for more effective processing.
Mike

loocas · August 16, 2011, 11:48pm

Or a robust database that you can querry over the network as much as you like…

loocas · August 18, 2011, 8:24pm

Well, the Auto Job Timeout feature, obviously, doesn’t work on my system…

[attachment=0]auto_timeout_bug.png[/attachment]

rrussell · August 19, 2011, 1:37pm

Sorry for not jumping in on this thread sooner.

The issue is that the slave is losing track of the task. Because the slave is responsible for monitoring the render time of its tasks to determine if a timeout occurs, a task that no longer has a slave associated with it can not timeout.

When a slave is rendering a task, it will check periodically if the task file it is working on “still exists”. If it doesn’t, then under normal circumstances that means that the task has been requeued or the job has been deleted, in which case the slave should move on.

To prevent false positives (ie: due to a disconnection from the repository), the slave will only assume the task has been requeued if it can still access the task’s job folder, and will only assume it has been deleted if it can access the “jobs” folder in the repository and the task’s job folder no longer exists.

However, it seems like there is a possibility for a task file to not be found when it is actually there. We think this is network related, but obviously still unacceptable. As Chris mentioned in another of your threads, we have a plan to deal with these types of issues.

Just to confirm the problem you’re seeing is what I think it is, please enable slave verbose logging if it isn’t already. Then restart your slave applications so that they recognize the change immediately. The next time this happens, go to the slave machine that lost the task and find the slave log from that session (in the slave, select Help -> Explore log folder). The slave will actually print out that it can’t find the task, and will dump the contents of the job’s task folder to show what it is seeing. Please post the log and we’ll take a look.

Thanks!

Ryan

twuelfing · August 19, 2011, 3:05pm

here is all the data i could gather on our end related to our experience with this issue.
This happens quite frequently

log report from the slave at the time the event was going on

Log Message

0: Plugin already loaded: 3dsmax
0: Task timeout is disabled.
0: Job already loaded: 15_IAD_BG
0: Plugin rendering frame(s): 285
0: INFO: Render Tasks called
0: INFO: Lightning: Render frame 285
0: INFO: Lightning: Overriding save file option to 1
0: INFO: Lightning: Rendering frame R:\01_PROJECTS_CURRENT\2011 - International Air Defense\02_Prod\15_IAD_AMRAAM_Callouts\Renders\BG\15_IAD_BG_0285.png
0: INFO: Lightning: Rendering 0 render elements
0: INFO: Lightning: Setting up render parameters
0: INFO: Lightning: Rendering camera node 15_IAD_CAM_Move
0: INFO: Lightning: Checking multipass info for frame 285
0: INFO: Lightning: Opening renderer
0: INFO: Lightning: Multipass disabled
0: INFO: Lightning: Rendering 1 passes
0: INFO: Lightning: Checking output paths
0: INFO: Lightning: Checking default actions
0: INFO: Lightning: TYPE_MISSING_EXTERNAL_FILES = IGNORE
0: INFO: Lightning: TYPE_MISSING_UVWS = IGNORE
0: INFO: Lightning: TYPE_MISSING_DLL_FILES = IGNORE
0: INFO: Lightning: TYPE_MISSING_XREF_FILES = IGNORE
0: INFO: Lightning: Calling renderer
0: INFO: Loading Map Files
0: INFO: Preparing Objects
0: INFO: Computing Antialiasing Filter
0: INFO: Preparing Lights
0: INFO: Rendering Reflect/Refract Maps
0: INFO:
0: INFO: Clipping Objects
0: INFO: Transforming Objects
0: INFO: Lightning: CallCurRendererRenderFrame returned code 1
0: INFO: Lightning: Render done
0: INFO: Lightning: writing output to temp file: C:\Users\106967~1\AppData\Local\Temp\15_IAD_BG_0285.png
0: INFO: Lightning: Saved image to R:\01_PROJECTS_CURRENT\2011 - International Air Defense\02_Prod\15_IAD_AMRAAM_Callouts\Renders\BG\15_IAD_BG_0285.png
0: INFO: Lightning: Checking render elements
0: Render time for frame: 2.292 s
0: Total time for task: 4.109 c

=======================================================
Log Details

Log Date/Time = Aug 19/11 07:49:35
Frames = 285-285

Slave Machine = Ttuc-3d40
Slave Version = v5.0.0.44528 R

Plugin Name = 3dsmax

rrussell · August 19, 2011, 3:15pm

We would need the slave log, not the task log. The slave log can be found from the Slave UI by selecting Help -> Explore Log Folder.

Thanks!

Ryan

loocas · August 19, 2011, 4:59pm

Ok, I’ll do that. I’ll be rendering a ton of such jobs, so, it’ll surely come up again.

RobCapper · November 28, 2011, 9:29pm

This is becoming a really, really big problem for us too. It is particularly bad for things like tile assembly jobs where the tiles are deleted afterwards, so if the task is resubmitted it will just error out as it cannot find the tiles. We have a series of queued up jobs that are each dependant on the last. I have also noticed it is particularly bad with faster tasks.

I am not sure if we are affected more because we predominantly deal with rendering stills using tile based rendering. I signed up for the beta thinking that it might be a fix to the problem but it looks like the beta will expire in early December so it is not a good solution for us and I did not go ahead with it. I seem to be doing more and more babysitting of the renders and releasing pending tasks etc just to make sure jobs are completed.

Have people been able to fully fix this or are there just workarounds for it? It seems like such an odd thing for a task to not report that it is complete?

rrussell · November 28, 2011, 9:48pm

The beta licenses expire at the end of December because Deadline 5.1 will be released mid-December. As long as you are on active subscription, you will be entitled to a permanent 5.1 license as soon as it is released. This task issue has been addressed in 5.1, so we highly recommend using this version.

Cheers,

Ryan

RobCapper · November 28, 2011, 9:54pm

Ah ok. Thanks for clarification. I will install as soon as possible.

Stuck tasks, why?

log report from the slave at the time the event was going on

Log Message

======================================================= Log Details

=======================================================
Log Details