hanging frames 5.1.0.46398

We upgraded last week to the latest build and are now having chronic frame hanging issues

here is a sample log file from a frame that is hanging
Thanks,

=======================================================
Log Message

0: Plugin already loaded: 3dsmax
0: Task timeout is disabled.
0: Job already loaded: 002_Tac_map retimed low res
0: Plugin rendering frame(s): 451
0: INFO: Render Tasks called
0: INFO: Lightning: Render frame 451
0: INFO: Lightning: Overriding save file option to 1
0: INFO: Lightning: Rendering frame L:\CONOPS_Visualization\2011_NWS_Re-Branding Efforts\Rebranding Video\02_Prod\002_Tac_Map\renders\TacMap_Retimed\002_Tacmap_retimed_0451.png
0: INFO: Lightning: Rendering 0 render elements
0: INFO: Lightning: Setting up render parameters
0: INFO: Lightning: Rendering camera node CAM_002_tacmap_Move_Extended
0: INFO: Lightning: Checking multipass info for frame 451
0: INFO: Lightning: Opening renderer
0: INFO: Lightning: Multipass disabled
0: INFO: Lightning: Rendering 1 passes
0: INFO: Lightning: Checking output paths
0: INFO: Lightning: Checking default actions
0: INFO: Lightning: TYPE_MISSING_EXTERNAL_FILES = IGNORE
0: INFO: Lightning: TYPE_MISSING_UVWS = IGNORE
0: INFO: Lightning: TYPE_MISSING_DLL_FILES = IGNORE
0: INFO: Lightning: TYPE_MISSING_XREF_FILES = IGNORE
0: INFO: Lightning: Calling renderer
0: INFO: Translating Scene…
0: INFO: Translating Object City_Main_00
0: INFO: Translating Scene…
0: INFO: Translating Object Airbus_A330-300_KLM_A_81
0: INFO: Translating Scene…
0: INFO: Translating Object Airbus_A330-300_KLM_A_32
0: INFO: Translating Scene…
0: INFO: Translating Object City_Main_10
0: INFO: Translating Scene…
0: INFO: Translating Object City_Main_97
0: INFO: Translating Scene…
0: INFO: Translating Object City_Main_71
0: INFO: Translating Scene…
0: INFO: Translating Object City_Main_03
0: INFO: Translating Scene…
0: INFO: Translating Object Airbus_A330-300_KLM_A_76
0: INFO: Translating Scene…
0: INFO: Translating Object City_Main_22
0: INFO: Translating Scene…
0: INFO: Translating Object Airbus_A330-300_KLM_A_74
0: INFO: Translating Scene…
0: INFO: Translating Object Airbus_A330-300_KLM_A_14
0: INFO: Translating Scene…
0: INFO: Translating Object City_Main_58
0: INFO: Translating Scene…
0: INFO: Translating Object City_Main_14
0: INFO: Translating Scene…
0: INFO: Translating Object City_Main_98
0: INFO: Translating Scene…
0: INFO: Translating Object City_Main_83
0: INFO: Translating Scene…
0: INFO: Translating Object Main_Terrain
0: INFO: Translating Scene…
0: INFO: Translating Object Airbus_A330-300_KLM_A_50
0: INFO: Translating Scene…
0: INFO: Translating Object City_Main_62
0: INFO: Translating Scene…
0: INFO: Translating Object City_Main_107
0: INFO: Translating Scene…
0: INFO: Translating Object City_Main_27
0: INFO: Translating Scene…
0: INFO: Translating Object Airbus_A330-300_KLM_A_80
0: INFO: Translating Scene…
0: INFO: Translating Object City_Main_101
0: INFO: Translating Scene…
0: INFO: Translating Object City_Main_104
0: INFO: Translating Scene…
0: INFO: Rendering Frame…
0: INFO: Rendering…
0: INFO: 0.1% Computing Final Gather Points
0: INFO: 0.2% Computing Final Gather Points
0: INFO: 0.3% Computing Final Gather Points
0: INFO: 0.4% Computing Final Gather Points
0: INFO: 0.5% Computing Final Gather Points
0: INFO: 0.6% Computing Final Gather Points
0: INFO: 0.7% Computing Final Gather Points
0: INFO: 0.8% Computing Final Gather Points
0: INFO: 0.9% Computing Final Gather Points
0: INFO: 1.0% Computing Final Gather Points
0: INFO: 1.1% Computing Final Gather Points
0: INFO: 1.1% Computing Final Gather Points
0: INFO: 1.2% Computing Final Gather Points
0: INFO: 1.3% Computing Final Gather Points
0: INFO: 1.4% Computing Final Gather Points
0: INFO: 1.5% Computing Final Gather Points
0: INFO: 1.6% Computing Final Gather Points
0: INFO: 1.7% Computing Final Gather Points
0: INFO: 1.8% Computing Final Gather Points
0: INFO: 1.9% Computing Final Gather Points
0: INFO: 2.0% Computing Final Gather Points
0: In the process of canceling current task: ignoring exception thrown by PluginLoader
0: Unloading plugin: 3dsmax

=======================================================
Log Details

Log Date/Time = Mar 12/12 12:13:13
Frames = 451-451

Slave Machine = Ttuc-3d36
Slave Version = v5.1.0.46398 R

Plugin Name = 3dsmax

Some things to check to help narrow down the problem:

  1. Is the frame hanging, or is it just taking a very long time to render? If you remote into the slave, is it slowly printing out these final gather messages? Is there any activity in the 3dsmax.exe process in Task Manager?
  2. Do you see this behavior with all Max jobs, or just specific ones? If you submit a simple test scene, does the problem still occur?
  3. Do all tasks for a job hang, or is it just the later tasks for the job?
  4. Do the same tasks hang reliably, or is it completely random?

Cheers,

  • Ryan
  1. Is the frame hanging, or is it just taking a very long time to render? If you remote into the slave, is it slowly printing out these final gather messages? Is there any activity in the 3dsmax.exe process in Task Manager?
    It seems to just be taking a very long time to render, around 25 times as long as the other frames which have identical content.

  2. Do you see this behavior with all Max jobs, or just specific ones? If you submit a simple test scene, does the problem still occur?
    It does not happen with every job, but seems to be happening more often than others. Sorry this isn’t more defined.

  3. Do all tasks for a job hang, or is it just the later tasks for the job?
    not all tasks hang, and it doesn’t seem to be more frequently happening at any given point in the job, it appears random.

  4. Do the same tasks hang reliably, or is it completely random?
    If we re queue the tasks they will render in the same amount of time as the rest of the job


more things we noticed…

both active jobs claim to be rendering on one slave node simultaneously. One is the job that “hanged” the other is a job submited after.

Based on your last post, it sounds like the task isn’t hanging, but is being dropped by the slave. What we need is a log from the slave from the session where it says it’s working on 2 tasks from different jobs. To find the local slave logs, go to a slave UI and select Help -> Explore Log Folder. Use the Date Modified property of the file, or the timestamps in the slave logs, to find the log from this session. Send it to us and we’ll take a look.

Thanks!

  • Ryan

This issue was happening yesterday (3-12-2012) and there are not log files from that entire day.
This issue was not just happening on this one slave, but this slave was the one I captured the images from.

Here is a zip file with all the txt documents in the log folder for the dates surrounding this issue. I apologize for all the data, but I just dont know what I am looking for.
3d17 logs.zip (2.48 MB)

Thanks for the logs! We found where the Task from frame 147 was lost, but it’s not obvious why it happened. We walked though out code with the debugger to see if we could reproduce, but we couldn’t, and we can’t see any reason why the task would get lost at that particular moment.

One thing we noticed was that the slave was trying to connect to Pulse, but is unable to. Is Pulse running on your farm? If it is, it may be possible that some slaves can connect to it and others can’t. I can’t tell if this would cause the problem or not, but if Pulse is serving some slaves, while others look for tasks themselves, maybe it’s possible that Pulse moves a task file when a slave that is manually searching isn’t expecting it to.

I think I may have inadvertently fixed this yesterday.

here is what I think may have happened

I run Pulse on our repository machine
pulse was killed when i installed the new repository version
I restarted the slaves on all our nodes to have them grab the new version of deadline.
I did not restart deadline launcher on the repository as I forgot that pulse was killed
all these issues happened in between that time when I upgraded deadline and restarted pulse

does this shed any light on the symptoms?

That’s interesting. It should work fine without Pulse running (and that’s how we were testing it here), but maybe the farm was in some sort of inconsistent state and running Pulse “fixed” things. I’m glad to hear that things have returned to normal, but definitely keep an eye out for this issue and let us know if it comes up again.

Cheers,

  • Ryan