Two slaves rendering the same task

anon63061193 · November 13, 2011, 12:50pm

Beta build: 5 and earlier
System: Fedora 12
Custom command plugin: sends bash command to the mantra renderer

The job is completed but some random frames, despite having correct size of the output exr file, fail to load in any image viewing/compositing package. I notice that those broken frames (tasks) have two logs, and the correct ones have only one. No errors. The rendertime of the faulty ones in the monitor shows 0:00:00.

I’m attaching the task log.
job_log.zip (4.51 KB)

anon63061193 · November 13, 2011, 4:05pm

It could be related to this problem:

I’ve just submitted other job from XSI using Your plugin (NOT batch).

3 of the tasks were successfully completed without errors in about 3 minutes but 6 of them are still rendering for about 50 minutes now… Their logs:

=======================================================
Log Message
=======================================================
0: Task timeout is disabled.
0: Loaded job: shot0070_animPrv - grey_pass (003_050_000_6d832894)
0: INFO: StartJob: initializing script plugin XSI
0: INFO: Any stdout that matches the regular expression ".*ERROR :.*" will be handled as appropriate
0: INFO: Any stdout that matches the regular expression "(Rendering frame: [0-9]+) \(([0-9]+) % done\)" will be handled as appropriate
0: INFO: Any stdout that matches the regular expression "(Rendering frame: [0-9]+) \(([0-9]+\.[0-9]+) % done\)" will be handled as appropriate
0: INFO: Any stdout that matches the regular expression "'INFO : Rendering done!" will be handled as appropriate
0: INFO: Any stdout that matches the regular expression "'INFO : Couldn't find the pass name .*" will be handled as appropriate
0: INFO: Any stdout that matches the regular expression ".*progr:\s+([0-9]{1,3}.[0-9]+)%\s+rendered\son.*" will be handled as appropriate
0: INFO: Any stdout that matches the regular expression ".*progr:\s+rendering\sfinished.*" will be handled as appropriate
0: INFO: Any stdout that matches the regular expression ".*Render completed \(100% done\).*" will be handled as appropriate
0: INFO: Any stdout that matches the regular expression ".*Unknown flag \(ignored\) : -render.*" will be handled as appropriate
0: INFO: Any stdout that matches the regular expression ".*INFO : 4000 - Comp.*" will be handled as appropriate
0: INFO: Any stdout that matches the regular expression ".*INFO : 4000 - Done.*" will be handled as appropriate
0: INFO: Any stdout that matches the regular expression ".*\[arnold\].* +([0-9]+)% done.*" will be handled as appropriate
0: INFO: Any stdout that matches the regular expression ".*\[arnold\].* render done.*" will be handled as appropriate
0: INFO: About: XSI Plugin for Deadline
0: Plugin rendering frame(s): 21-40
0: INFO: Stdout Handling Enabled: True
0: INFO: Popup Handling Enabled: True
0: INFO: Using Process Tree: True
0: INFO: Hiding DOS Window: True
0: INFO: Creating New Console: False
0: INFO: Rendering with XSI version 9
0: INFO: Not enforcing a build of XSI
0: INFO: Render Executable: "/opt/package/softimage_2011.5/Softimage_2011_Subscription_Advantage_Pack/Application/bin/xsibatch.deadline"
0: INFO: CheckPathMapping: Swapped "Z:\PROD\dev\tp\sherlock\shot0070\software\xsi\Scenes\wojak\shot0070_anim_v05_r05_fin.scn" with "/PROD/dev\tp\sherlock\shot0070\software\xsi\Scenes\wojak\shot0070_anim_v05_r05_fin.scn"
0: INFO: Rendering using new batch mode introduced in XSI 6.01
0: INFO: wojak
0: INFO: CheckPathMapping: Swapped "Z:\PROD\dev\tp\sherlock\sherlock\render\prev\" with "/PROD/dev\tp\sherlock\sherlock\render\prev\"
0: INFO: Rendering to network drive
0: INFO: Render Argument: wojak -render "/PROD/dev/tp/sherlock/shot0070/software/xsi/Scenes/wojak/shot0070_anim_v05_r05_fin.scn" -frames 21,40,1 -pass "grey_pass" -output_dir "/PROD/dev/tp/sherlock/sherlock/render/prev" -skip False -mb True -verbose on -script "/home/deadline/Deadline/slave/Render01/plugins/AdditionalRenderOptions.vbs" -main SetAdditionalRenderOptions -args -passName "grey_pass" -width "960" -height "540" -sampleMin "1" -sampleMax "1" -sampleFilter "gauss" -sampleJitter "True" -outputPrefix "" -xMin "" -yMin "" -xMax "" -yMax ""
0: INFO: Startup Directory: "/opt/package/softimage_2011.5/Softimage_2011_Subscription_Advantage_Pack/Application/bin"
0: INFO: Process Priority: Normal
0: INFO: Process is now running
0: In the process of canceling current task: ignoring exception thrown by PluginLoader


=======================================================
Log Details
=======================================================
Log Date/Time = Nov 13/11  16:05:35
Frames = 21-40

Slave Machine = Render01
Slave Version = v5.1.0.45783 R

Plugin Name = XSI

nothing else…

Your XSI plugin is slightly modified to add “user” parameter at the beginning of render parameters which is then taken by a simple bash script.
So for example we have:

render parameters: “user -render scene.scn -frames 1-100” etc.
The bash script takes the first parameter and runs “xsibatch” file with “sudo” command which allows us to generate files from the renderer per “user”:

#!/bin/bash
user=$1
echo $user

shift

sudo -i -u $user bash -c "'/opt/package/softimage_2011.5/Softimage_2011_Subscription_Advantage_Pack/Application/bin/xsibatch $@'"

So as You can see there is nothing advanced here, and as i’ve said some of the frames are rendered completely. The log is simillar to one of the logs from faulty Mantra job.

rrussell · November 14, 2011, 5:22pm

That’s really strange. Do you have Pulse running? If so, has it been upgrade to beta 5 as well? We should definitely check out the slave and pulse logs (if applicable) around that time to see what was going on. Just let us know if you are currently running Pulse or not, and then we’ll let you know what logs we need.

Thanks!

Ryan

anon63061193 · November 14, 2011, 6:22pm

No pulse. Our new server with ssd disk is not ready just yet, and we dont want to put a pulse on any workstation, but on a server.

The mantra job example:

For a total of 180 frames there are 4 bad ones.

Different slaves each time.
Every bad task had two logs.
Two of the bad task have 0:00:00 render/task time and two of them have normal time, which is difficult to track them from a monitor only.
3 of 4 were varying in startup time (i mean the global time of a slave getting the task, like 19:34:18 for the first one, 19:34:19 for the second) except for 1 which had the difference between the logs about a minute and a half.

rrussell · November 14, 2011, 6:41pm

Thanks for confirming. Just one question:

Do you mean 2 different slaves in each of the 4 cases (ie: render01 and render05 for the first task, render03 and render04 for the second task, etc), or it was 2 different slaves, but the same 2 slaves in each case (ie: render03 and render04 for the first task, render03 and render04 for the second task, etc)? Just curious if a couple of slaves are potentially getting their wires crossed, since they all share the same local file system.

anon63061193 · November 14, 2011, 6:43pm

yep, Your first guess:

2 different slaves in each of the 4 cases (ie: render01 and render05 for the first task, render03 and render04 for the second task, etc)

anon63061193 · November 14, 2011, 6:45pm

attaaching log from the job.

Forgot to mention:

the frames are: 7, 54, 93, 174

(93 - i’m not 100% sure, but i’m out of the studio now)
003_050_008_3375589e.zip (265 KB)

rrussell · November 14, 2011, 7:12pm

Thanks for confirming. I have a guess on what could be happening, and it’s based on behavior we’ve seen between the slaves and Pulse.

We think network latency can result in a case where one slave renames a task file (when moving it from the Queued to Rendering state), but another slave still sees the original Queued task. This results in the second slave dequeuing the same task, but then later discarding it because it doesn’t “see” that task anymore (since it’s actually in the Rendering state). The same thing could happen between Pulse and a slave. Pulse would rename the task, the slave would start working on it, but then discard it because it doesn’t see that task in the Rendering state yet. We resolved the Pulse/Slave issue by having the slave wait to confirm its task’s existence before starting rendering. We don’t have the luxury of doing this between two slaves though.

It would be really interesting to see if the problem persists once you are using Pulse. As a test, maybe you could run pulse on one of the slaves for a few days. You can tell that the slaves are connecting to Pulse because the heart icon in the status bar will light up. In the Pulse Settings in the Repository Options, just make sure that the Task Confirmation feature is enabled.

Cheers,

Ryan

anon63061193 · November 14, 2011, 10:06pm

Will do! Thanks!