Slaves Hanging


#1

I occasionally run into a situation where Fusion (4) tasks seem to hang. Requiring me to requeue the tasks to resolve it. They hang for hours and once requeued render out in minutes. I was curious if anyone else was running into this problem and if the new version addresses this.



David Miller


#2

I occasionally run into a situation where Fusion (4) tasks seem to hang.
Requiring me to requeue the tasks to resolve it. They hang for hours and
once requeued render out in minutes. I was curious if anyone else was
running into this problem and if the new version addresses this.

Is the hanging due to a crash? Can you vnc into the machine while its
“taking a rest”?

We had occasional crashes with df, and setting a default slave timeout
took care of that. DF script is really touchy, and tends to flip out
every now and then. Deadline can’t catch all the crashes, so a master
timeout is usually a good solution. Especially with df renders, where
you cant be surprised by a multi-hour long frame on occasion (like
with mental ray & motionblur_ :wink:

cheers,
laszlo


#3

The machine itself is responding fine. I have thought about using the task timeout but seems to be a bit messy especially when the same flow can contain tasks that take minutes and task that take hours. Is there anything that I can do to help the development team catch these Fusion errors that cause this problem?



David Miller


#4

A task timeout is messy as some arbitrary number but it may be possible to change the sematics of the slave such that if fusion hasn’t consumed any CPU time (or memory usage hasn’t changed) in some ordinate amount of time then the task is killed and the frame requeued. This of course will not work if there is an infinite loop (constantly chewing CPU time doing nothing productive in a loop) based hang and not a dead lock (ie. does nothing until an event happens which never comes).



You can tell what kind of ‘hang’ it is by opening up the task manager on the particular slave. Go to the processes tab and add a few columns (under the view menu), I would suggest ‘CPU Time’ and ‘Memory Usage Delta’. Watch those numbers for a few minutes, if CPU Time doesn’t increase then you probably have a ‘lock’ based bug and if the ‘Memory Usage Delta’ is not zero then something is happening (ie. memory is being allocated and/or freed) and it could be an infinite loop.


#5

Sorry I just got back to the board today. Next time I see the problem I’ll take a look at those things and let you know.





David