Stalled Slave

My slaves keep stalling and I cant see to figure out why

Below is the report and I have put some comments in bold red

this isn’t the only slave doing this, all of the 64 bit nodes on our farm are experiencing this.

thanks for your help

STALLED SLAVE REPORT

Current House Cleaner Information
MachineName: Ttuc-rc45 <-- this machine is in a different pool, why is this being shown here?
Version: v3.1.0.36430 R

slaveMachineName: Ttuc-3d21 <-- this is the correct machine name for the slave that stalled
slaveInfo.UpdateDateTime: 2009-09-05 06:58:39
DateTime.Now: 2009-09-05 07:09:35
Time Difference: 10.941 m

Job Name: SDB_MMW_Shot
Task Names: 444
Task Ids: 444

searching for job with id “001_075_999_708a99c2”
Found possible job: SDB_MMW_Shot
searching for task with id “444”
Found possible task: :[444-444]
task.SlaveMachineName: Ttuc-3d21
slave machine names match, stop searching
Associated Job Found: SDB_MMW_Shot
Associated Task Found: :[444-444]
probableStalledTask.SlaveMachineName: Ttuc-3d21
task is still rendering, attempting to fix situation.
requeuing task
slaveInfo.SlaveStatus set to Offline
slaveInfo.UpdateDateTime set to now

slaveInfo updated.

A few questions:

  1. Are the Slave applications running normally on these machines despite being marked as Stalled in the Monitor, or are they hanging/crashing?
  2. Are the Slave applications having any problems connecting to the repository? Maybe check the slave log window for any error messages relating to problems accessing the repository paths.
  3. Which OS do you have the repository installed on?
  4. Is there a time difference between the problematic machines and the other machines in your farm (including the repository)?

This is the machine that detected that Ttuc-3d21 was stalled. Perhaps we have better log messages here to indicate this.

Cheers,

  • Ryan
  1. Are the Slave applications running normally on these machines despite being marked as Stalled in the Monitor, or are they hanging/crashing?
    they are crashing and the frames are not saved, then the machine sends a stalled slave email report

  2. Are the Slave applications having any problems connecting to the repository? Maybe check the slave log window for any error messages relating to problems accessing the repository paths.
    Here is a slave error log:
    ocal Settings/Application Data/Frantic Films/Deadline/slave/jobsData/MCWS Intro 08.max
    2009/09/08 07:51:55 WRN: MAXScript Callback script Exception: – Runtime error: send: error trying to send data: WSAError code 10054
    2009/09/08 07:51:55 INF: Starting network rendering
    2009/09/08 07:51:55 INF: Max is ready to begin render
    2009/09/08 07:51:56 INF: Server is not responding
    2009/09/08 07:52:29 DBG: Stop network
    2009/09/08 07:52:29 ERR: mental ray Fatal Error: DB 0.0 fatal 041052: accessing unknown tag 0x8bc
    ) (Deadline.Plugins.ScriptPlugin+FailRenderException) (Deadline.Plugins.RenderPluginException)
    at Deadline.Plugins.ScriptPlugin.RenderTasks(Int32 startFrame, Int32 endFrame, String& outMessage) (Deadline.Plugins.RenderPluginException)
    0: Unloading plugin: 3dsmax
    Scheduler Thread - Render Thread 0 threw an error:
    Scheduler Thread - Exception during render: An error occurred in RenderTasks(): RenderTask: Unexpected exception (Monitored managed process “3dsmaxProcess” has exited or been terminated.
    2009/09/08 06:04:58 INF: Loaded C:/Documents and Settings/vab4661/Local Settings/Application Data/Frantic Films/Deadline/slave/jobsData/MCWS Intro 08.max
    2009/09/08 06:05:02 INF: Job: C:/Documents and Settings/vab4661/Local Settings/Application Data/Frantic Films/Deadline/slave/jobsData/MCWS Intro 08.max
    2009/09/08 07:51:55 WRN: MAXScript Callback script Exception: – Runtime error: send: error trying to send data: WSAError code 10054
    2009/09/08 07:51:55 INF: Starting network rendering
    2009/09/08 07:51:55 INF: Max is ready to begin render
    2009/09/08 07:51:56 INF: Server is not responding
    2009/09/08 07:52:29 DBG: Stop network
    2009/09/08 07:52:29 ERR: mental ray Fatal Error: DB 0.0 fatal 041052: accessing unknown tag 0x8bc
    ) (Deadline.Plugins.ScriptPlugin+FailRenderException) (Deadline.Plugins.RenderPluginException)
    at Deadline.Plugins.ScriptPlugin.RenderTasks(Int32 startFrame, Int32 endFrame, String& outMessage)
    Error Type
    RenderPluginException
    Error Stack Trace
    at Deadline.Plugins.Plugin.RenderTask(Int32 startFrame, Int32 endFrame)
    at Deadline.Slaves.SlaveRenderThread.RenderCurrentTask()

  3. Which OS do you have the repository installed on?
    Windows Server 2003

  4. Is there a time difference between the problematic machines and the other machines in your farm (including the repository)?
    All the machines are exhibiting this behavior.

Thanks for the info. So to confirm, when you go to the slave after it has been marked Stalled, the slave application isn’t running anymore? Or is it just the renders themselves that are crashing? From the error message, it doesn’t look like the slave itself is crashing, so I just want to make sure that’s not the case. Although if the slave itself isn’t crashing, I’m not sure why they would appear as stalled…

Also, this error message seems to indicate a problem with mental ray:

Do you get these errors when rendering with scanline or another renderer? Also, are you able to open Max normally on your slaves and render the same scene file locally? Just trying to narrow down the problem.

Cheers,

  • Ryan

sorry about the delayed response, I thought the problem corrected itself but it is back.

The problem is not isolated to mental ray,
most of the time the slave stays running, but the frames dont finish and there dont seem to be errors.

Our Max licenses dont allow us to run the program on the slaves as they are installed as render only.
below are a log and error report from a job causing this behavior

Here is a sample log

Log Message
0: Task timeout is disabled.
0: Loaded job: SDBII - Inverted Main Shot - Dome No MB (001_088_999_0cff9a7d)
0: INFO: StartJob: initializing script plugin 3dsmax
0: INFO: About: 3dsmax Plugin for Deadline
0: INFO: Start Job called - starting up 3dsmax plugin
0: INFO: Rendering with 3dsmax version: 2010
0: INFO: Build of 3dsmax to force: 64bit
0: INFO: Rendering with executable: C:\Program Files\Autodesk\3ds Max 2010\3dsmax.exe
0: INFO: Fail on existing 3dsmax process: 0
0: INFO: Load 3dsmax timeout: 18000 seconds
0: INFO: Start job timeout: 18000 seconds
0: INFO: Progress update timeout: 8000 seconds
0: INFO: Progress update timout disabled: 0
0: INFO: Slave mode enabled: 1
0: INFO: Silent mode enabled: 0
0: INFO: Local rendering enabled: 1
0: INFO: Strict 3dsmax check enabled: 1
0: INFO: Verifying installation of 3dsmax 2010 at C:\Program Files\Autodesk\3ds Max 2010
0: INFO: Version of core.dll: 12.1.0.310
0: INFO: Ensuring 3dsmax install is valid
0: INFO: Running 3dsmaxcmd workaround
0: INFO: 3dsmaxcmd.exe returned: P9/29/2009 9:14:08 AM; Error opening scene file: “C:\Program Files\Autodesk\3ds Max 2010”
0: INFO: 3dsmax start up file: C:\Documents and Settings\vab4661\Local Settings\Application Data\Frantic Films\Deadline\slave\plugins\deadlineStartupMax2010.max
0: INFO: Using user profiles: 1
0: INFO: Using 3dsmax ini file: C:\Documents and Settings\vab4661\Local Settings\Application Data\Autodesk\3dsmax\2010 - 64bit\enu\3dsmax.ini
0: INFO: Network log file: C:\Documents and Settings\vab4661\Local Settings\Application Data\Autodesk\3dsmax\2010 - 64bit\enu\Network\Max.log
0: INFO: Plugin ini file: C:\Program Files\Autodesk\3ds Max 2010\plugin.ini
0: INFO: Not including user profile plugin ini because it does not exist: C:\Documents and Settings\vab4661\Local Settings\Application Data\Autodesk\3dsmax\2010 - 64bit\enu\Plugin.UserSettings.ini
0: INFO: Lightning connection plugin: C:\Documents and Settings\vab4661\Local Settings\Application Data\Frantic Films\Deadline\slave\plugins\lightning64Max2010.dlx
0: INFO: 3dsmax socket connection port: 1666
0: INFO: Creating startup script: C:\Documents and Settings\vab4661\Local Settings\Temp\lightning_startup.ms
0: INFO: Copying C:\Documents and Settings\vab4661\Local Settings\Application Data\Frantic Films\Deadline\slave\plugins\lightning64Max2010.dlx to C:\Documents and Settings\vab4661\Local Settings\Application Data\Frantic Films\Deadline\temp\lightning\lightning.dlx
0: INFO: Copying dl.ini to C:\Documents and Settings\vab4661\Local Settings\Application Data\Autodesk\3dsmax\2010 - 64bit\enu to workaround a problem with version 2010
0: INFO: If this fails, make sure that the necessary permissions are set on this folder to allow for this copy to take place
0: INFO: Starting monitored managed process 3dsmaxProcess
0: INFO: Ignoring popup windows with titles matching regex “.Brazil Console.
0: INFO: Ignoring popup windows with titles matching regex “.MSP Acceleration.
0: INFO: Ignoring popup windows with titles matching regex “.FumeFX:.
0: INFO: Ignoring popup windows with titles matching regex “.FumeFX Dynamics:.
0: INFO: Ignoring popup windows with titles matching regex “.Maxwell Translation Window.
0: INFO: Ignoring popup windows with titles matching regex “.New updates are available - Craft Director Tools.
0: INFO: Handling popup windows with titles matching regex “.VRay authorization.” by pressing “Cancel”
0: INFO: Handling popup windows with titles matching regex “.V-Ray warning.” by pressing “OK”
0: INFO: Handling popup windows with titles matching regex “.File Load: Units Mismatch.” by pressing “Adopt the File’s Unit Scale?;OK”
0: INFO: Handling popup windows with titles matching regex “.Frame Rate Change.” by pressing “OK”
0: INFO: Handling popup windows with titles matching regex “.MAXScript Auto-load Script Error.” by pressing “OK”
0: INFO: Handling popup windows with titles matching regex “.Render history settings.” by pressing “OK;No”
0: INFO: Handling popup windows with titles matching regex “.Warning - the software has encountered a problem.” by pressing “Don’t show me this error again;Continue”
0: INFO: Handling popup windows with titles matching regex “.Frantic Films FPS Watchdog.” by pressing “OK”
0: INFO: Handling popup windows with titles matching regex “.Missing Dlls.” by pressing “Cancel”
0: INFO: Handling popup windows with titles matching regex “Brazil r/s Rio Warning” by pressing “OK”
0: INFO: Handling popup windows with titles matching regex “3D Studio MAX” by pressing “OK”
0: INFO: Handling popup windows with titles matching regex “.New updates are available - Craft Director Tools.” by pressing “Cancel”
0: INFO: Handling popup windows with titles matching regex “.Pop-up Note.” by pressing “OK”
0: INFO: Handling popup windows with titles matching regex “.Tablet Version Mismatch.” by pressing “OK”
0: INFO: Handling popup windows with titles matching regex “.Tablet Driver.” by pressing “OK”
0: INFO: Handling popup windows with titles matching regex “.Gather error.” by pressing “OK”
0: INFO: Handling popup windows with titles matching regex “.Image I/O Error.” by pressing “Retry”
0: INFO: Handling popup windows with titles matching regex “.Important nPower Plug-in Messages.” by pressing “OK”
0: INFO: Handling popup windows with titles matching regex “.glu3D.” by pressing “OK”
0: INFO: Handling popup windows with titles matching regex “.glu3D Warning!.” by pressing “OK”
0: INFO: Handling popup windows with titles matching regex “.Bitmap Filter Error.” by pressing “OK”
0: INFO: Handling popup windows with titles matching regex “.Maxwell Plug-in Update Notification.” by pressing “Don’t notify me about this version automatically;Close”
0: INFO: Handling popup windows with titles matching regex “.RealFlow Plug-in Update Notification.” by pressing “Don’t notify me about this version automatically;Close”
0: INFO: Handling popup windows with titles matching regex “.Learning Movies.” by pressing “Show this dialog at startup;Close”
0: INFO: Handling popup windows with titles matching regex “.Obsolete File.” by pressing “Don’t display this message.;OK”
0: INFO: Stdout Handling Enabled: False
0: INFO: Popup Handling Enabled: True
0: INFO: Using Process Tree: True
0: INFO: Hiding DOS Window: False
0: INFO: Creating New Console: False
0: INFO: Render Executable: “C:\Program Files\Autodesk\3ds Max 2010\3dsmax.exe”
0: INFO: Render Argument: -p “dl.ini” -q -s “C:\Documents and Settings\vab4661\Local Settings\Application Data\Frantic Films\Deadline\slave\plugins\deadlineStartupMax2010.max”
0: INFO: Startup Directory: “C:\Program Files\Autodesk\3ds Max 2010”
0: INFO: Process Priority: High
0: INFO: Process is now running
0: INFO: Waiting for connection from 3dsmax
0: INFO: Connected to 3dsmax plugin version Lightning 3.1.0.36381 Jun 11 2009 11:14:37 R
0: INFO: Scene file to render: C:\Documents and Settings\vab4661\Local Settings\Application Data\Frantic Films\Deadline\slave\jobsData\NEW SDBII_FALLING DOWNWARD_INVERTED__SDBII ONLY__08.max
0: INFO: Camera: Cam - F15
0: INFO: Restarting 3dsmax after each frame: 0
0: INFO: Showing frame buffer: 1
0: INFO: Overriding render output: R:\01_PROJECTS_CURRENT\SDBII_2009\02_Prod SAL Attack and Immeadiate Attack (f-15)\31_F15_Follow_Bomb_INVERTED\Max\Renders\06_REVISIONS\06_NEW_SDBII\06_NEW_SDBII_.tga
0: INFO: Overriding save file option: True
0: INFO: Frame number base: 0
0: INFO: Remove padding from output filename: 0
0: INFO: Ignore missing external file errors: 1
0: INFO: Ignore missing UVW errors: 1
0: INFO: Ignore missing XREF errors: 1
0: INFO: Ignore missing DLL errors: 1
0: INFO: Disabling Multipass: 0
0: INFO: Loading 3dsmax scene file
0: INFO: [filename]
[format]
[size] 640x480
[aspect] 1

0: INFO: Executing script: C:\Documents and Settings\vab4661\Local Settings\Application Data\Frantic Films\Deadline\slave\plugins\customize.ms
0: INFO: Applying customizations to max file…
0: INFO: >General Renderer Settings:
0: INFO: +Render Width set to 720
0: INFO: +Render Height set to 480
0: INFO: +Pixel Aspect set to 1.2
0: INFO: +Atmospherics set to true
0: INFO: +Effects set to true
0: INFO: +Displacements set to true
0: INFO: +Area Lights/Shadows As Points set to false
0: INFO: +Force 2-Sided set to false
0: INFO: +Render Hidden Geometry set to false
0: INFO: +Video Color Check set to false
0: INFO: +Super Black set to false
0: INFO: +Render To Fields set to false
0: INFO: +Skip Existing Frames set to false
0: INFO: >Render Environment Settings:
0: INFO: +Background Color set to (color 0 0 0)
0: INFO: +Use Environment Map set to false
0: INFO: +Ambient Color set to (color 0 0 0)
0: INFO: +Global Light Tint Color set to (color 255 255 255)
0: INFO: +Global Light Level set to 1.0
0: INFO: >Default Scanline Renderer detected as current renderer.
0: INFO: +Antialias Filter set to Catmull_Rom
0: INFO: +Mapping set to true
0: INFO: +Shadows set to true
0: INFO: +Auto-Reflect/Reflect and Mirrors set to true
0: INFO: +Auto-Reflect Rendering Iterations set to 1
0: INFO: +Force Wireframe set to false
0: INFO: +Wireframe Thickness set to 1.0
0: INFO: +Antialiasing set to true
0: INFO: +Antialiasing Filter Size set to 4.0
0: INFO: +Filter Maps set to true
0: INFO: +Enable Global Sampler set to true
0: INFO: +Global Sampler Class set to Hammersley
0: INFO: +Global Sampler Supersample Maps set to true
0: INFO: +Global Sampler Quality set to 0.5
0: INFO: +Global Sampler Adaptive set to true
0: INFO: +Global Sampler Adaptive Threshold set to 0.1
0: INFO: +Apply Object Motion Blur set to true
0: INFO: +Object Motion Blur Duration set to 0.5
0: INFO: +Object Motion Blur Samples set to 10
0: INFO: +Object Motion Blur Subdivisions set to 10
0: INFO: +Apply Image Motion Blur set to false
0: INFO: +Image Motion Blur Duration set to 1.0
0: INFO: +Apply Image Motion Blur To Environment Map set to false
0: INFO: +Image Motion Blur Transparency set to false
0: INFO: +Enable SSE set to true
0: INFO: +Conserve Memory set to false
0: INFO: >Performing XREF workaround…
0: Plugin rendering frame(s): 580
0: INFO: Render Tasks called
0: INFO: Lightning: Render frame 580
0: INFO: Lightning: Overring save file option to 1
0: INFO: Lightning: Rendering frame R:\01_PROJECTS_CURRENT\SDBII_2009\02_Prod SAL Attack and Immeadiate Attack (f-15)\31_F15_Follow_Bomb_INVERTED\Max\Renders\06_REVISIONS\06_NEW_SDBII\06_NEW_SDBII_0580.tga
0: INFO: Lightning: Rendering 0 render elements
0: INFO: Lightning: Setting up render parameters
0: INFO: Lightning: Rendering camera node Cam - F15
0: INFO: Lightning: Checking multipass info for frame 580
0: INFO: Lightning: Opening renderer
0: INFO: Lightning: Multipass disabled
0: INFO: Lightning: Rendering 1 passes
0: INFO: Lightning: Checking output paths
0: INFO: Lightning: Checking default actions
0: INFO: Lightning: TYPE_MISSING_EXTERNAL_FILES = IGNORE
0: INFO: Lightning: TYPE_MISSING_UVWS = IGNORE
0: INFO: Lightning: TYPE_MISSING_DLL_FILES = IGNORE
0: INFO: Lightning: TYPE_MISSING_XREF_FILES = IGNORE
0: INFO: Lightning: Calling renderer
0: INFO: Loading Map Files
0: INFO: Preparing Objects
0: INFO: Computing Antialiasing Filter
0: INFO: Preparing Lights
0: INFO: Rendering Reflect/Refract Maps
0: INFO: Clipping Objects
0: INFO: Transforming Objects
0: INFO: Rendering Reflect/Refract Maps
0: INFO:
0: INFO: Clipping Objects
0: INFO: Transforming Objects
Repository time: 09/29/2009 09:33:06
Repository time: 09/29/2009 09:53:13
Repository time: 09/29/2009 10:22:17
Repository time: 09/29/2009 10:54:48
Info Thread - Cancelling task because task filename “\ttuc-rc2\deadlinerepository\jobs\001_088_999_0cff9a7d\Rendering\001_088_999_0cff9a7d_00420_580-580.Ttuc-3d19” could not be found, it was likely requeued
sending cancel task command to plugin
0: In the process of canceling current task: ignoring exception thrown by PluginLoader
0: Unloading plugin: 3dsmax

Log Details
Log Date/Time = Sep 29/09 11:10:34
Frames = 580-580

Slave Machine = Ttuc-3d19
Slave Version = v3.1.0.36430 R

Plugin Name = 3dsmax

Here is a sample error report from the same job

Error Message
STALLED SLAVE REPORT

Current House Cleaner Information
MachineName: Ttuc-rc52
Version: v3.1.0.36430 R

slaveMachineName: Ttuc-3d08
slaveInfo.UpdateDateTime: 2009-09-29 10:37:29
DateTime.Now: 2009-09-29 10:52:58
Time Difference: 15.486 m

Job Name: SDBII - Inverted Main Shot - Dome No MB
Task Names: 581
Task Ids: 421

searching for job with id “001_088_999_0cff9a7d”
Found possible job: SDBII - Inverted Main Shot - Dome No MB
searching for task with id “421”
Found possible task: :[581-581]
task.SlaveMachineName: Ttuc-3d08
slave machine names match, stop searching
Associated Job Found: SDBII - Inverted Main Shot - Dome No MB
Associated Task Found: :[581-581]
probableStalledTask.SlaveMachineName: Ttuc-3d08
task is still rendering, attempting to fix situation.
requeuing task
slaveInfo.SlaveStatus set to Offline
slaveInfo.UpdateDateTime set to now

slaveInfo updated.

Slave Log
Error Type
Error Stack Trace

Can you send us the slave log from the local machine that contains this rendered task? You can find the logs folder from the Slave by selecting Help -> Explore Log Folder.

By the way, are you aware of any problems your network may be having? If the slave is having problems connecting to the repository for any reason, that could explain the behavior you’re seeing. Maybe watch to see what the network load is like when these problems occur. If there are connectivity problems, the slave log you provide should indicate that.

Cheers,

  • Ryan

I believe that this is the correct log file

hopefully the attachment works
deadlineslave(Ttuc-3d08)-2009-09-28-0000.log (1.3 MB)

Thanks for the log. Nothing looks out of the ordinary, so it doesn’t appear that there are any network issues. We’re currently in the process of debugging a similar problem in the current 4.0 beta (final version due out around the end of November), so it could be that is related, although the slave doesn’t stall out in the beta version.

Another thing to check: is the date/time on all your client machines and the repository in sync?

Cheers,

  • Ryan

how close does the time have to be?

The times should be as accurate as possible. Deadline determines if a slave has stalled based on the last time it updated its slave status file. If the times are off, this can result in false stalled slave detection. Now that I think about it, what’s likely happening is that Deadline thinks a particular slave has stalled and requeues its task. That would explain why the slave then detects that the task has been requeued and moves on to something else (this is shown in the log you posted).

If you don’t want to use a built-in time sync mechanism (I think every OS has this), Deadline has an option to sync times to a particular machine using SNTP. You can configure the SNTP settings in the Repository Options in the Monitor.

Cheers,

  • Ryan

they seem to be synced, but I made the changes to the repository anyway just in case.

I will resubmit a job that was causing this behavior when i get a chance.

thanks for all the help thus far.

Has there been any more ideas as to why this is happening?

We have jobs that just wont render because of this behavior.

I guess setting the time sync option didn’t help. Maybe one thing to keep an eye on is the Last Status Update column in the slave list in the Monitor. If you ever see this get higher than 2 minutes for an idle or rendering slave, then we know that the slave is having trouble updating its slave info. For the jobs that this problem is affecting, are they pretty heavy jobs (ie: do they use 100% and push your RAM limitations)?

We’re still running tests with the 4.0 beta to try and resolve this problem (assuming it’s the same problem that you’re having). The beta program will run until the end of November, so if you’re interested in joining and helping us figure this one out, click on the link:
viewtopic.php?f=10&t=2778

Cheers,

  • Ryan

I have some preliminary results of my testing and I think I may have an answer.

I believe the stalling behavior was a result of me having set the process priority to high for our rendering tasks within the repository.

I need to do a few more renders to see if changing it back to the default will have slowed our renders, and I want to switch it back and verify that the behavior returns but it looks like this was the culprit.

I will post back here after I run the job a few more times to recreate the behavior and to verify that render times are unaffected.

thanks for your help and hopefully this one is solved.

Ah, that would make sense actually. If the rendering process is eating up 100% CPU with high priority, then the slave won’t be able to get the tiny amount of CPU it needs to update its status on a regular basis. That would explain the stalled slave reports.

Cheers,

  • Ryan