Ok, I’ve been struggling to pin this one down/get some concrete evidence of it, and I think I’ve got some useful error logs.
Basically, anytime we submit a Max job with Pulse running, the job will corrupt itself after a random length of time in which it renders fine. This happens with Brazil, VRay, and the default scanline renderer. If Pulse is turned off, the job renders fine.
The slaves are all running Vista Business x64 SP2, and the problem is occurring in Max 2011 x64, fully up to date. The Pulse machine is running Server 2008 x86 SP2. Before upgrading to 4.1, we never had this problem (i.e. the Deadline upgrade is the only thing that has changed).
Here is the interesting section of the Slave log for one such occurrence:
2010-08-24 12:52:18: Starting between task wait for available threads - seconds: 3
2010-08-24 12:52:21: Scheduler - Contacting Deadline Pulse running on “sfs-file”, port 5046.
2010-08-24 12:52:21: Scheduler - Requesting work from Deadline Pulse…
2010-08-24 12:52:25: Scheduler - Pulse returned work.
2010-08-24 12:52:25: 0: Got task: 999_060_999_07b58bfa_00021_21-21
2010-08-24 12:52:25: 0: Plugin already loaded: 3dsmax
2010-08-24 12:52:25: 0: Task timeout is disabled.
2010-08-24 12:52:25: 0: Job already loaded: bubble-bots_ - Diffuse Robots
2010-08-24 12:52:25: 0: Plugin rendering frame(s): 21
2010-08-24 12:52:26: 0: INFO: Render Tasks called
2010-08-24 12:52:26: 0: INFO: Lightning: Render frame 21
2010-08-24 12:52:26: 0: INFO: Lightning: Overriding save file option to 1
2010-08-24 12:52:26: 0: INFO: Lightning: Rendering frame W:\SF Bubble Bots\Renders\3D\Diffuse\r01_v00\bubble_bots_diffuse_r01_v00_0021.exr
2010-08-24 12:52:26: 0: INFO: Lightning: Rendering 0 render elements
2010-08-24 12:52:26: 0: INFO: Lightning: Checking output paths
2010-08-24 12:52:26: 0: INFO: Lightning: Checking default actions
2010-08-24 12:52:26: 0: INFO: Lightning: TYPE_MISSING_EXTERNAL_FILES = IGNORE
2010-08-24 12:52:26: 0: INFO: Lightning: TYPE_MISSING_UVWS = IGNORE
2010-08-24 12:52:26: 0: INFO: Lightning: TYPE_MISSING_DLL_FILES = FAIL
2010-08-24 12:52:26: 0: INFO: Lightning: TYPE_MISSING_XREF_FILES = IGNORE
2010-08-24 12:52:26: 0: INFO: Lightning: Calling renderer
2010-08-24 12:52:27: 0: INFO: Updating View
2010-08-24 12:52:27: 0: INFO: Updating World
2010-08-24 12:52:27: 0: INFO: Updating Geometry
2010-08-24 12:52:27: 0: INFO: Updating Materials
2010-08-24 12:52:27: 0: INFO: Updating Lights
2010-08-24 12:52:27: 0: INFO: Updating Environment and Effects
2010-08-24 12:52:27: 0: INFO: Updating World
2010-08-24 12:52:27: 0: INFO: Updating Render Database
2010-08-24 12:52:28: 0: INFO: Updating Ray Server
2010-08-24 12:52:28: 0: INFO: Initializing Render Clients
2010-08-24 12:52:28: 0: INFO: Broadcasting Frame Begin
2010-08-24 12:52:28: 0: INFO: Generating Render Caches
2010-08-24 12:53:23: 0: INFO: Rendering Frame
2010-08-24 12:56:22: 0: INFO: Broadcasting Frame End
2010-08-24 12:56:22: 0: INFO: Lightning: CallCurRendererRenderFrame returned code 1
2010-08-24 12:56:22: 0: INFO: Lightning: Render done
2010-08-24 12:56:22: 0: INFO: Lightning: writing output to temp file: C:\Users\RENDER~1\AppData\Local\Temp\bubble_bots_diffuse_r01_v00_0021.exr
2010-08-24 12:56:22: 0: INFO: Lightning: Saved image to W:\SF Bubble Bots\Renders\3D\Diffuse\r01_v00\bubble_bots_diffuse_r01_v00_0021.exr
2010-08-24 12:56:22: 0: INFO: Lightning: Checking render elements
2010-08-24 12:56:22: 0: Render time for frame: 3.941 m
2010-08-24 12:56:22: 0: Total time for task: 3.969 m
2010-08-24 12:56:23: Scheduler Thread - Exception occurred while trying to report task completed.
2010-08-24 12:56:23: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2010-08-24 12:56:23:
Exception Details
JobCorruptedException – There is an error in XML document (19, 4). (does not contain any frames.
Parameter name: framesList.Length
Actual value was 0.) (Job has been corrupted, it is recommended that this job be removed from the repository: \nas\nas\repository4\jobs\999_060_999_07b58bfa\999_060_999_07b58bfa.job)
JobCorruptedException.JobFileName: \nas\nas\repository4\jobs\999_060_999_07b58bfa\999_060_999_07b58bfa.job
Exception.Data: ( )
Exception.TargetSite: Deadline.Jobs.Job LoadJobFromFile(System.String)
Exception.Source: deadline
Exception.StackTrace:
at Deadline.Storage.JobStorage.LoadJobFromFile(String filename)
at Deadline.Storage.JobStorage.LoadJob(String jobId, Boolean archived)
at Deadline.Storage.JobStorage.RefreshJob(Job& job, Boolean forceRefresh)
at Deadline.Controllers.DeadlineController.RefreshJob(Job& job)
at Deadline.Slaves.SlaveSchedulerThread.CheckJobForCompletion(Job job)
at Deadline.Slaves.SlaveSchedulerThread.ReportTaskCompleted(Int32 taskIndex, TimeSpan renderTime, TimeSpan taskTime)
2010-08-24 12:56:23: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2010-08-24 12:56:23: Scheduler Thread - Can not continue until this operation is completed successfully
2010-08-24 12:56:23: Scheduler Thread - Waiting 20 seconds before retrying…
2010-08-24 12:56:43: Scheduler Thread - Render Thread 0’s task filename “\nas\nas\repository4\jobs\999_060_999_07b58bfa\Rendering\999_060_999_07b58bfa_00021_21-21.Render-i7-04” could not be found, it was likely requeued
2010-08-24 12:56:43: Scheduler Thread - Contents of job’s Rendering folder:
2010-08-24 12:56:43: Scheduler Thread - \nas\nas\repository4\jobs\999_060_999_07b58bfa\Rendering\999_060_999_07b58bfa_00023_23-23.Render-06
2010-08-24 12:56:43: Scheduler Thread - \nas\nas\repository4\jobs\999_060_999_07b58bfa\Rendering\999_060_999_07b58bfa_00025_25-25.Render-08
2010-08-24 12:56:43: Scheduler Thread - \nas\nas\repository4\jobs\999_060_999_07b58bfa\Rendering\999_060_999_07b58bfa_00027_27-27.Render-10
2010-08-24 12:56:43: Scheduler Thread - \nas\nas\repository4\jobs\999_060_999_07b58bfa\Rendering\999_060_999_07b58bfa_00028_28-28.task
2010-08-24 12:56:43: Scheduler Thread - \nas\nas\repository4\jobs\999_060_999_07b58bfa\Rendering\999_060_999_07b58bfa_00029_29-29.Render-i7-01
2010-08-24 12:56:43: Scheduler Thread - \nas\nas\repository4\jobs\999_060_999_07b58bfa\Rendering\999_060_999_07b58bfa_00026_26-26.Render-09
2010-08-24 12:56:43: Scheduler Thread - \nas\nas\repository4\jobs\999_060_999_07b58bfa\Rendering\999_060_999_07b58bfa_00024_24-24.Render-i7-02
2010-08-24 12:56:43: Starting between task wait for available threads - seconds: 5
2010-08-24 12:56:48: Scheduler - Contacting Deadline Pulse running on “sfs-file”, port 5046.
2010-08-24 12:56:48: Scheduler - Requesting work from Deadline Pulse…
2010-08-24 12:56:49: Scheduler - No jobs found.
2010-08-24 12:56:49: Scheduler - Could not get a task from Deadline Pulse because: There is an error in XML document (19, 4). (does not contain any frames.
Parameter name: framesList.Length
Actual value was 0.) (Job has been corrupted, it is recommended that this job be removed from the repository: \nas\nas\repository4\jobs\999_060_999_07b58bfa\999_060_999_07b58bfa.job)
at Deadline.Storage.JobStorage.LoadJobFromFile(String filename)
at Deadline.Storage.JobStorage.LoadJob(String jobId, Boolean archived)
at Deadline.Storage.JobStorage.RefreshJob(Job& job, Boolean forceRefresh)
at Deadline.Scheduling.SchedulerUtils.DequeueTasksFromPulse(DeadlineController deadlineController, DeadlineNetworkSettings networkSettings, SlaveState& slaveState, Boolean& invalidMessage)
2010-08-24 12:56:49: Scheduler - Ignoring job with ID “999_100_999_52c15311”. reason:
2010-08-24 12:56:49: Scheduler - minor exception: There is an error in XML document (19, 4). (does not contain any frames.
Parameter name: framesList.Length
Actual value was 0.) (Job has been corrupted, it is recommended that this job be removed from the repository: \nas\nas\repository4\jobs\999_100_999_52c15311\999_100_999_52c15311.job) (Deadline.Jobs.JobCorruptedException)
2010-08-24 12:56:49: Scheduler - Job chooser found no jobs.
2010-08-24 12:56:49: Scheduler - Job chooser found no jobs.
2010-08-24 12:56:49: 0: Shutdown
2010-08-24 12:56:50: 0: Shutdown
2010-08-24 12:56:50: 0: Exited ThreadMain(), cleaning up…
2010-08-24 12:56:50: 0: INFO: End Job called - shutting down 3dsmax plugin
2010-08-24 12:56:50: 0: INFO: Disconnecting socket connection to 3dsmax
2010-08-24 12:56:50: 0: INFO: Waiting for 3dsmax to shut down
2010-08-24 12:56:50: 0: INFO: 3dsmax has shut down
2010-08-24 12:56:50: 0: Stopped job: bubble-bots_ - Diffuse Robots
2010-08-24 12:56:50: 0: Unloading plugin: 3dsmax
2010-08-24 12:56:51: starting between task wait - seconds: 0.05
2010-08-24 12:56:51: Scheduler - Contacting Deadline Pulse running on “sfs-file”, port 5046.
2010-08-24 12:56:51: Scheduler - Requesting work from Deadline Pulse…
2010-08-24 12:57:00: Scheduler - No jobs found.
The weird thing is that the Pulse log shows business as usual throughout this whole ordeal, except for some weird lines that come during the rendering of the frame above (as indicated by the timestamps):
…
2010-08-24 12:53:46: Clean Up Thread - Performing house cleaning
2010-08-24 12:53:46: Purging repository temp root files
2010-08-24 12:53:46: Cleaning up orphaned tasks
2010-08-24 12:53:46: Requeuing orphaned task ‘231’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘247’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘178’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘144’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘245’ for render-04: the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘176’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘141’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘150’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘182’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘184’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘244’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘147’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Purging obsolete slaves
2010-08-24 12:53:46: Scanning for dropped jobs…
2010-08-24 12:53:46: Done.
2010-08-24 12:53:46: Performing Job Repository Scan…
2010-08-24 12:53:47: Done.
…
Any thoughts?