Deadline 4.1 Job Corruption

Ok, I’ve been struggling to pin this one down/get some concrete evidence of it, and I think I’ve got some useful error logs.

Basically, anytime we submit a Max job with Pulse running, the job will corrupt itself after a random length of time in which it renders fine. This happens with Brazil, VRay, and the default scanline renderer. If Pulse is turned off, the job renders fine.

The slaves are all running Vista Business x64 SP2, and the problem is occurring in Max 2011 x64, fully up to date. The Pulse machine is running Server 2008 x86 SP2. Before upgrading to 4.1, we never had this problem (i.e. the Deadline upgrade is the only thing that has changed).

Here is the interesting section of the Slave log for one such occurrence:


2010-08-24 12:52:18: Starting between task wait for available threads - seconds: 3
2010-08-24 12:52:21: Scheduler - Contacting Deadline Pulse running on “sfs-file”, port 5046.
2010-08-24 12:52:21: Scheduler - Requesting work from Deadline Pulse…
2010-08-24 12:52:25: Scheduler - Pulse returned work.
2010-08-24 12:52:25: 0: Got task: 999_060_999_07b58bfa_00021_21-21
2010-08-24 12:52:25: 0: Plugin already loaded: 3dsmax
2010-08-24 12:52:25: 0: Task timeout is disabled.
2010-08-24 12:52:25: 0: Job already loaded: bubble-bots_ - Diffuse Robots
2010-08-24 12:52:25: 0: Plugin rendering frame(s): 21
2010-08-24 12:52:26: 0: INFO: Render Tasks called
2010-08-24 12:52:26: 0: INFO: Lightning: Render frame 21
2010-08-24 12:52:26: 0: INFO: Lightning: Overriding save file option to 1
2010-08-24 12:52:26: 0: INFO: Lightning: Rendering frame W:\SF Bubble Bots\Renders\3D\Diffuse\r01_v00\bubble_bots_diffuse_r01_v00_0021.exr
2010-08-24 12:52:26: 0: INFO: Lightning: Rendering 0 render elements
2010-08-24 12:52:26: 0: INFO: Lightning: Checking output paths
2010-08-24 12:52:26: 0: INFO: Lightning: Checking default actions
2010-08-24 12:52:26: 0: INFO: Lightning: TYPE_MISSING_EXTERNAL_FILES = IGNORE
2010-08-24 12:52:26: 0: INFO: Lightning: TYPE_MISSING_UVWS = IGNORE
2010-08-24 12:52:26: 0: INFO: Lightning: TYPE_MISSING_DLL_FILES = FAIL
2010-08-24 12:52:26: 0: INFO: Lightning: TYPE_MISSING_XREF_FILES = IGNORE
2010-08-24 12:52:26: 0: INFO: Lightning: Calling renderer
2010-08-24 12:52:27: 0: INFO: Updating View
2010-08-24 12:52:27: 0: INFO: Updating World
2010-08-24 12:52:27: 0: INFO: Updating Geometry
2010-08-24 12:52:27: 0: INFO: Updating Materials
2010-08-24 12:52:27: 0: INFO: Updating Lights
2010-08-24 12:52:27: 0: INFO: Updating Environment and Effects
2010-08-24 12:52:27: 0: INFO: Updating World
2010-08-24 12:52:27: 0: INFO: Updating Render Database
2010-08-24 12:52:28: 0: INFO: Updating Ray Server
2010-08-24 12:52:28: 0: INFO: Initializing Render Clients
2010-08-24 12:52:28: 0: INFO: Broadcasting Frame Begin
2010-08-24 12:52:28: 0: INFO: Generating Render Caches
2010-08-24 12:53:23: 0: INFO: Rendering Frame
2010-08-24 12:56:22: 0: INFO: Broadcasting Frame End
2010-08-24 12:56:22: 0: INFO: Lightning: CallCurRendererRenderFrame returned code 1
2010-08-24 12:56:22: 0: INFO: Lightning: Render done
2010-08-24 12:56:22: 0: INFO: Lightning: writing output to temp file: C:\Users\RENDER~1\AppData\Local\Temp\bubble_bots_diffuse_r01_v00_0021.exr
2010-08-24 12:56:22: 0: INFO: Lightning: Saved image to W:\SF Bubble Bots\Renders\3D\Diffuse\r01_v00\bubble_bots_diffuse_r01_v00_0021.exr
2010-08-24 12:56:22: 0: INFO: Lightning: Checking render elements
2010-08-24 12:56:22: 0: Render time for frame: 3.941 m
2010-08-24 12:56:22: 0: Total time for task: 3.969 m
2010-08-24 12:56:23: Scheduler Thread - Exception occurred while trying to report task completed.
2010-08-24 12:56:23: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2010-08-24 12:56:23:
Exception Details
JobCorruptedException – There is an error in XML document (19, 4). (does not contain any frames.
Parameter name: framesList.Length
Actual value was 0.) (Job has been corrupted, it is recommended that this job be removed from the repository: \nas\nas\repository4\jobs\999_060_999_07b58bfa\999_060_999_07b58bfa.job)
JobCorruptedException.JobFileName: \nas\nas\repository4\jobs\999_060_999_07b58bfa\999_060_999_07b58bfa.job
Exception.Data: ( )
Exception.TargetSite: Deadline.Jobs.Job LoadJobFromFile(System.String)
Exception.Source: deadline
Exception.StackTrace:
at Deadline.Storage.JobStorage.LoadJobFromFile(String filename)
at Deadline.Storage.JobStorage.LoadJob(String jobId, Boolean archived)
at Deadline.Storage.JobStorage.RefreshJob(Job& job, Boolean forceRefresh)
at Deadline.Controllers.DeadlineController.RefreshJob(Job& job)
at Deadline.Slaves.SlaveSchedulerThread.CheckJobForCompletion(Job job)
at Deadline.Slaves.SlaveSchedulerThread.ReportTaskCompleted(Int32 taskIndex, TimeSpan renderTime, TimeSpan taskTime)

2010-08-24 12:56:23: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
2010-08-24 12:56:23: Scheduler Thread - Can not continue until this operation is completed successfully
2010-08-24 12:56:23: Scheduler Thread - Waiting 20 seconds before retrying…
2010-08-24 12:56:43: Scheduler Thread - Render Thread 0’s task filename “\nas\nas\repository4\jobs\999_060_999_07b58bfa\Rendering\999_060_999_07b58bfa_00021_21-21.Render-i7-04” could not be found, it was likely requeued
2010-08-24 12:56:43: Scheduler Thread - Contents of job’s Rendering folder:
2010-08-24 12:56:43: Scheduler Thread - \nas\nas\repository4\jobs\999_060_999_07b58bfa\Rendering\999_060_999_07b58bfa_00023_23-23.Render-06
2010-08-24 12:56:43: Scheduler Thread - \nas\nas\repository4\jobs\999_060_999_07b58bfa\Rendering\999_060_999_07b58bfa_00025_25-25.Render-08
2010-08-24 12:56:43: Scheduler Thread - \nas\nas\repository4\jobs\999_060_999_07b58bfa\Rendering\999_060_999_07b58bfa_00027_27-27.Render-10
2010-08-24 12:56:43: Scheduler Thread - \nas\nas\repository4\jobs\999_060_999_07b58bfa\Rendering\999_060_999_07b58bfa_00028_28-28.task
2010-08-24 12:56:43: Scheduler Thread - \nas\nas\repository4\jobs\999_060_999_07b58bfa\Rendering\999_060_999_07b58bfa_00029_29-29.Render-i7-01
2010-08-24 12:56:43: Scheduler Thread - \nas\nas\repository4\jobs\999_060_999_07b58bfa\Rendering\999_060_999_07b58bfa_00026_26-26.Render-09
2010-08-24 12:56:43: Scheduler Thread - \nas\nas\repository4\jobs\999_060_999_07b58bfa\Rendering\999_060_999_07b58bfa_00024_24-24.Render-i7-02
2010-08-24 12:56:43: Starting between task wait for available threads - seconds: 5
2010-08-24 12:56:48: Scheduler - Contacting Deadline Pulse running on “sfs-file”, port 5046.
2010-08-24 12:56:48: Scheduler - Requesting work from Deadline Pulse…
2010-08-24 12:56:49: Scheduler - No jobs found.
2010-08-24 12:56:49: Scheduler - Could not get a task from Deadline Pulse because: There is an error in XML document (19, 4). (does not contain any frames.
Parameter name: framesList.Length
Actual value was 0.) (Job has been corrupted, it is recommended that this job be removed from the repository: \nas\nas\repository4\jobs\999_060_999_07b58bfa\999_060_999_07b58bfa.job)
at Deadline.Storage.JobStorage.LoadJobFromFile(String filename)
at Deadline.Storage.JobStorage.LoadJob(String jobId, Boolean archived)
at Deadline.Storage.JobStorage.RefreshJob(Job& job, Boolean forceRefresh)
at Deadline.Scheduling.SchedulerUtils.DequeueTasksFromPulse(DeadlineController deadlineController, DeadlineNetworkSettings networkSettings, SlaveState& slaveState, Boolean& invalidMessage)
2010-08-24 12:56:49: Scheduler - Ignoring job with ID “999_100_999_52c15311”. reason:
2010-08-24 12:56:49: Scheduler - minor exception: There is an error in XML document (19, 4). (does not contain any frames.
Parameter name: framesList.Length
Actual value was 0.) (Job has been corrupted, it is recommended that this job be removed from the repository: \nas\nas\repository4\jobs\999_100_999_52c15311\999_100_999_52c15311.job) (Deadline.Jobs.JobCorruptedException)
2010-08-24 12:56:49: Scheduler - Job chooser found no jobs.
2010-08-24 12:56:49: Scheduler - Job chooser found no jobs.
2010-08-24 12:56:49: 0: Shutdown
2010-08-24 12:56:50: 0: Shutdown
2010-08-24 12:56:50: 0: Exited ThreadMain(), cleaning up…
2010-08-24 12:56:50: 0: INFO: End Job called - shutting down 3dsmax plugin
2010-08-24 12:56:50: 0: INFO: Disconnecting socket connection to 3dsmax
2010-08-24 12:56:50: 0: INFO: Waiting for 3dsmax to shut down
2010-08-24 12:56:50: 0: INFO: 3dsmax has shut down
2010-08-24 12:56:50: 0: Stopped job: bubble-bots_ - Diffuse Robots
2010-08-24 12:56:50: 0: Unloading plugin: 3dsmax
2010-08-24 12:56:51: starting between task wait - seconds: 0.05
2010-08-24 12:56:51: Scheduler - Contacting Deadline Pulse running on “sfs-file”, port 5046.
2010-08-24 12:56:51: Scheduler - Requesting work from Deadline Pulse…
2010-08-24 12:57:00: Scheduler - No jobs found.

The weird thing is that the Pulse log shows business as usual throughout this whole ordeal, except for some weird lines that come during the rendering of the frame above (as indicated by the timestamps):



2010-08-24 12:53:46: Clean Up Thread - Performing house cleaning
2010-08-24 12:53:46: Purging repository temp root files
2010-08-24 12:53:46: Cleaning up orphaned tasks
2010-08-24 12:53:46: Requeuing orphaned task ‘231’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘247’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘178’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘144’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘245’ for render-04: the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘176’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘141’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘150’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘182’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘184’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘244’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Requeuing orphaned task ‘147’ for : the task has been in the rendering state for over 10 minutes, but the slave is no longer rendering this task
2010-08-24 12:53:46: Purging obsolete slaves
2010-08-24 12:53:46: Scanning for dropped jobs…
2010-08-24 12:53:46: Done.
2010-08-24 12:53:46: Performing Job Repository Scan…
2010-08-24 12:53:47: Done.

Any thoughts?

Hi,

This sounds really familiar to the Backburner XML issue, but I may be completely wrong about this, as it has nothing to do with Pulse doing strange things…

Although this issue only caused issues for Max2009 & 2010 (2011 was already fixed by PF as they shipped Deadline v4.1), hence the extra patch supplied by PF, as Autodesk backported their latest hotfix 2 versions:
viewtopic.php?f=57&t=4132
However, the fact that the error is occuring in Max2011 for you is very strange.
By any chance did you overwrite any of your “\deadline\plugins\3dsmax” files, in particular the *.xml files from an older copy of Deadline you may have somewhere, v4.0?
In particular, the file “deadlineStartupMax2011.xml” should be 12kb in size (double the size of the other *.xml files in the 3dsmax plugin directory)?
Also the “deadlineStartupMax2011.xml” file if opened in a text editor should contain an entry along the following:

300

If it doesn’t, then you have the wrong version of XML file in your plugins folder, which a re-install of v4.1 will resolve.

Mike

Thanks for the suggestion Mike, but unfortunately the XML appears to be correct (12kb, with the “300” tag).

Any other thoughts? It’s quite annoying having to remember to turn the farm on every time we submit something… :smiley:

Since the error log mentions which XML file was corrupted, could you post that XML file?

For example, if the error is

JobCorruptedException – There is an error in XML document (19, 4). (does not contain any frames.
Parameter name: framesList.Length
Actual value was 0.) (Job has been corrupted, it is recommended that this job be removed from the repository: \nas\nas\repository4\jobs\999_060_999_07b58bfa\999_060_999_07b58bfa.job)
JobCorruptedException.JobFileName: \nas\nas\repository4\jobs\999_060_999_07b58bfa\999_060_999_07b58bfa.job

then you could zip and attach the file
\nas\nas\repository4\jobs\999_060_999_07b58bfa\999_060_999_07b58bfa.job
so we can take a look at it…

This will not fix the bug, but might give Ryan ideas where to look.

Are all machines, including the pulse machine, upgraded to 4.1? If the pulse version is different than the slaves, that could explain why things get corrupted. We haven’t had this problem reported by anyone else using 4.1, nor have we seen it on any of our 3 internal farms (each running Pulse and between 100 and 350 render nodes).

What OS is your repository installed on? Also, how many render nodes and workstations to you have connecting to the repository. Finally, as Bobo suggested, a sample of one of these XML files might be helpful too.

Cheers,

  • Ryan

Yes, all machines are running 4.1.42706.

The repository is installed on the same machine that is running Pulse, so the OS is Server 2008 x86 SP2; the repository tree itself exists on a NAS. There are 4 workstations and 15 render nodes that are consistently connected to the repo. This is the same configuration we’ve been using since 4.0 without any issues.

Unfortunately we’ve got quite a bit of rendering to do, so I don’t have time to let a job corrupt at the moment, but I should be able to in the next couple of days, and then I’ll post some files.

I have a very vague memory of having this issue once when a rogue slave was on a different time/date/region setting.
Configuring the SNTP options in Deadline sorted it out…
Mike

Well, I’ve been hammering the farm for the last 2 or 3 days with Pulse running but I can’t get it to corrupt anything anymore, so unless that pops up again, things seem to be working.

Of course, as soon as I hit submit on this, everything will explode.