Inconsistent results between MayaBatch and MayaCmd Plugins

rrussell · September 20, 2013, 7:10pm

Hi Seth,

The change to MayaBatch.py is the change we made further up this thread (viewtopic.php?f=86&t=10236#p44333). Originally when you reported this issue, we weren’t passing “arnold” to the mayaBatchRenderProcedure call, then then we changed it as an attempt to fix the issue.

That’s really interesting that the Solid Angle guys can’t reproduce the problem either. Perhaps it’s something specific about your environment.

I have not been able to render with your test scene (viewtopic.php?f=86&t=10236&start=10#p44409). As mentioned in that response, if you could send me a new scene created from scratch that reproduces it (and doesn’t use that shader of yours), I’ll give that a try.

Cheers,

Ryan

anon77753879 · September 20, 2013, 8:40pm

This is the absolute simplest file I could make.
There is no lens shader and no camera rig. It is a fresh camera baked out to the original camera animation.

We’ve run it here on the farm a number of times and it still exhibits the same behavior of frames starting to lose parts of texturing as nodes that have picked up a previous task on the job renders another subsequent task.

-andrew

rrussell · September 20, 2013, 8:47pm

Hey Andrew,

It looks like you forgot to upload the scene file.

Maybe you could give this scene file to the Solid Angle guys as well (if you haven’t already) to see if they can reproduce on their end with it.

Thanks!

Ryan

anon77753879 · September 20, 2013, 8:50pm

Woops. Here you go.

won_tstRenderFarm01_fx_turnTable_base_v009.zip (125 KB)

anon77753879 · September 20, 2013, 9:07pm

Here’s a zip with logs and images of 2 frames rendered from the same Slave on the same job picked up as subsequent task:

won_tstRenderFarm01_fx_turnTable_base_v009.zip (1.07 MB)

MikeOwen · September 21, 2013, 5:20pm

Hi guys,
Just a thought, looking at the 2 example logs, the first frame rendered by “RNODE017” takes 5 seconds and the second frame takes about 3 seconds from point of Arnold actually rendering to saving the image back to the file server over the network. The correct frame #0003 EXR is about 1Mb in size. Just wondering if the number of slaves you have and the speed of the renders (ie: these 2 logs indicate that approx. the first 150 frames are all rendered and saved in 52 seconds), is something that prior to beta 4 hasn’t occurred in your environment and therefore your file-server, maybe in combination with the network connectivity is bottle-necking here. Maybe as a simple test, you could just check “ON” the “Enable Local Rendering” setting in the Deadline-Maya submitter UI? I found when we enabled this setting, it had a dramatic effect in reducing the load across both our network and onto our storage system, as it renders the frame locally and then once the render task is complete, it attempts a file-copy back to the storage. (As a result, we keep this setting enabled ALL the time) Alternatively, maybe you could ask one of your storage guru’s to monitor your backend storage network connectivity & file-system load the next time you send a test job which you know will have issues?
Hope this helps!
Mike

0: STDOUT: 00:00:05 309MB | rendering image at 2048 x 764, 3 AA samples, 2 GI samples, 1 GI bounces 0: STDOUT: 00:00:05 309MB | initializing 13 nodes ... 0: STDOUT: 00:00:05 309MB | node initialization done in 0:00.00 0: STDOUT: 00:00:05 312MB | creating root object list ... 0: STDOUT: 00:00:05 312MB | no objects 0: STDOUT: 00:00:05 312MB | updating 14 nodes ... 0: STDOUT: 00:00:05 312MB | node update done in 0:00.00 0: STDOUT: 00:00:05 312MB | [aov] registered driver: "defaultArnoldDriver@driver_exr.RGBA" (driver_exr) 0: STDOUT: 00:00:05 312MB | [aov] * "RGBA" of type RGBA filtered by "defaultArnoldFilter@gaussian_filter" (gaussian_filter) 0: STDOUT: 00:00:05 312MB | [aov] done preparing 1 AOV for 1 output to 1 driver (0 deep AOVs) 0: STDOUT: 00:00:05 312MB | starting 24 bucket workers of size 64x64 ... 0: STDOUT: 00:00:10 353MB | bucket workers done in 0:04.92 0: STDOUT: 00:00:10 353MB | render done 0: STDOUT: 00:00:10 353MB | [driver_exr] writing file `U:/andrew.honacker/ToThinkBox/maya/images/beauty/won_tstRenderFarm01_fx_turnTable_base_v009.0003.exr'

0: STDOUT: 00:00:05 331MB | rendering image at 2048 x 764, 3 AA samples, 2 GI samples, 1 GI bounces 0: STDOUT: 00:00:05 331MB | initializing 13 nodes ... 0: STDOUT: 00:00:05 331MB | node initialization done in 0:00.00 0: STDOUT: 00:00:05 333MB | creating root object list ... 0: STDOUT: 00:00:05 333MB | no objects 0: STDOUT: 00:00:05 333MB | updating 14 nodes ... 0: STDOUT: 00:00:05 333MB | node update done in 0:00.00 0: STDOUT: 00:00:05 333MB | [aov] registered driver: "defaultArnoldDriver@driver_exr.RGBA" (driver_exr) 0: STDOUT: 00:00:05 333MB | [aov] * "RGBA" of type RGBA filtered by "defaultArnoldFilter@gaussian_filter" (gaussian_filter) 0: STDOUT: 00:00:05 333MB | [aov] done preparing 1 AOV for 1 output to 1 driver (0 deep AOVs) 0: STDOUT: 00:00:05 333MB | starting 24 bucket workers of size 64x64 ... 0: STDOUT: 00:00:08 354MB | bucket workers done in 0:02.85 0: STDOUT: 00:00:08 354MB | render done 0: STDOUT: 00:00:08 354MB | [driver_exr] writing file `U:/andrew.honacker/ToThinkBox/maya/images/beauty/won_tstRenderFarm01_fx_turnTable_base_v009.0152.exr'

rrussell · September 23, 2013, 3:18pm

Thanks for the new test scene! I was able to render this just fine here. I rendered 35 consecutive frames with the MayaBatch plugin on one machine, and I couldn’t reproduce the problem you’re seeing (every frame came out fine).

Perhaps you guys can try Mike’s suggestion of enabling Local Rendering when submitting the job. Maybe network load is resulting in corrupted images when they are saved directly over the network.

Cheers,

Ryan

anon77753879 · September 23, 2013, 7:26pm

Just tried enabling Local Rendering with the same results.

It has something to do with caching othwerwise the first frame each slave renders would be bad as well but they never are. It’s only subsequent frames that the node picks up on a job.

anon77753879 · September 23, 2013, 7:56pm

We can render the test scene on a single machine with Maya Batch successfully as well.

The problem occurs when rendering it through Deadline.

Setting the job to Reload Plugin Between Tasks fixes the problem but then it’s essentially the same as running Maya Command.
Everything is flushed from memory so the benefits of using Maya Batch go away.

rrussell · September 23, 2013, 8:13pm

Sorry, I should have said that I rendered on a single machine using Deadline. Here is what I did:

I updated the scene file you posted so that the path to the hdr file was correct.
I opened the updated file in Maya and submitted it to Deadline with the Use MayaBatch option enabled. I also set the machine list so that it would point to a single slave.
The job was picked up by that single slave, and it rendered the first 35 frames with no problems. The slave only loaded Maya for the initial frame, and kept it in memory for the remaining 34.

The reason I tested like this was because based on the first quite above, all these frames except the first should have been bad. But in my test they all came out fine.

That was just meant to workaround the issues you were having on Friday with Render.exe hanging with the MayaCmd plugin. It wasn’t meant as a final solution. However, since I’m unable to reproduce the problem here, and since local rendering doesn’t make a difference, I’m pretty stumped at this point. The logs you posted for the good and bad frames are also more or less identical, with no indication of what could be causing these bad frames.

MikeOwen · September 23, 2013, 8:42pm

Hi,
Just a thought…
Is this env variable “MAYA_RENDER_DESC_PATH” set on the slaves as you have mtoa on a network drive, so I assume it’s not been installed locally on the slaves?

support.solidangle.com/display/ … enderFlags

Arnold manual has this 1 line:

“If this is not set correctly then you may encounter errors when trying to render via a Windows shell.”

rrussell · September 24, 2013, 7:00pm

This may or may not be unrelated to the MayaBatch problem you guys are seeing, but we recently discovered that there was a bug with the PYTHONPATH and PYTHONHOME environment variables that were being set for the render processes that the slave starts up. We’ll be fixing this in beta 6, which we hope to get out this week. If you can upgrade to beta 6 when it becomes available, let us know if the problem still persists.

Thanks!

Ryan

anon77753879 · September 25, 2013, 1:37am

Hey guys. So… the plot thickens. We have found the cause of the problems. BUT we don’t know the cause OF the cause of the problems.

We have drive mappings set up in the Deadline Repo options. Removing those fixes the flickering problem.

Obviously this should not cause any problems. In fact they are mapped to the same drives that the files themselves are reading from. The textures in the file nodes are pointing to absolute paths on the mapped drives as well.

So if anything this should just be mirroring the same information as the submitted file.

What happens though is that a slave will pick up a task and you can see in the log that it is mapping the drives. The slave will finish a task and then pick up another task on the SAME Job and won’t remap the drives. Now that shouldn’t matter because A) the drives are mapped on the machine itself and B) the slave had already run a task on the job and should remember the settings.

Also this only began happening with Beta 4 from what we can see. We literally have a job on the queue that was submitted a day before we upgraded that we can resubmit FROM inside Monitor that still renders successfully. BUT if we resubmit the job from Maya the frames flicker.

The flickering SEEMS to have something to do with the mipmapped .tx files that Arnold uses. Since they’re mipmapped only the necessary data from the file is loaded. But it seems that the slave thinks it has the all correct .tx file in memory and doesn’t go looking for it on the mapped drives. It should be smart enough to know it doesn’t have everything it needs but something is confusing it.

Again, something is confusing it that changed when we upgraded to Beta 4. Nothing else changed that day.

We’re continuing to investigate further. Jeremy is setting up a D6 Beta 2 test environment that we’re going to run some test jobs through and see if we get different results.

For now we’re removing the drive mapping from the Repo configuration hoping that timeout issues from Windows doesn’t begin to occur on longer frames.

If you have any other ideas please let us know. And if you could check and see if anything related to how Deadline maps drives in the repo or how that information is sent to MayaBatch over the last few betas that would be helpful.

-andrew

rrussell · September 25, 2013, 2:37pm

Hey Andrew,

I checked the drive mapping code, and it hasn’t changed at all during the 6.1 beta. The last time this code was touched was back in 6.0, and that was when we added the option to only map a drive if it is unmapped. It sounds like you have this option currently enabled for you, since they don’t get remapped between tasks. Maybe you could try disabling that option to see if it makes a difference? You’ll probably have to restart your slaves so that they recognize this change immediately.

I wonder if when you’re submitting from Maya, the scene is being modified before it is submitted, and that modification (whatever it is) is resulting in this behavior.

Also, out of curiosity, when that job that is still in the Monitor was originally submitted, was the scene file submitted with the job? If it is, I wonder if that could be the difference that’s resulting in this behavior, since in the last set of logs you posted, the scene file was being loaded over the network.

Cheers,

Ryan

anon77753879 · September 25, 2013, 9:07pm

Hey guys.

So we set up a separate database and repo for Deadline 6 Beta 2 and the problem does not occur in that environment.

Also we’ve confirmed that on our Beta 4 repo that if we turn “Only Map If Unmapped” to True for all of the mapped drives then the problem goes away so we don’t have to remove the path mapping.

Beta 2 works whether the setting is True or False so something IS different between the beta 2 and beta 4 versions here.

Whether it’s set to True or False shouldn’t matter because it’s just mapping the same drive paths that are already mapped.

In the meantime we’ll just keep the Only Map set to True for now.

MikeOwen · September 25, 2013, 9:10pm

Beta 6 has been released today. Could you possibly download it and test again?

anon77753879 · September 25, 2013, 10:19pm

We updated our Alternate repo from Beta 2 (which worked) to Beta 4 just to confirm that it wasn’t something unique to our main repo and sure enough it’s exhibiting the same issue.

We are now going to upgrade it to Beta 6 and I’ll report back.

anon77753879 · September 26, 2013, 12:57am

Hey guys.

So we updated our Alt Repo to Beta 6 and now we’re getting Maya Python errors and the frames just hang.

I’ve included 2 of the logs here. One that we’ve set the PYTHONPATH and one where we don’t.
beta6_errorLog_pythonPathNOTset.txt (89 KB)
beta6_errorLog_setPythonPath.txt (94.4 KB)

rrussell · September 26, 2013, 12:58pm

Can you post your MayaBatch’s preload script? I’m thinking some of the python related bugs we’ve fixed in beta 6 might be causing this script to behave differently now. We can take a look to see if anything stands out.

For the path mapping issue, I took another look at everything that’s changed since beta 2, and I think I figured it out. In early versions of the beta, the mapping was only performed at the start of a job, and wasn’t done again until the slave picked up a new job. Now, the mapping is being performed before every task. This must be causing the problem, but when you enable the option to only map if the drive is unmapped, it works around the issue because the drive isn’t being repeatedly mapped anymore.

This also explains why I couldn’t reproduce it locally, because I changed the hdr path to a local one for testing.

In beta 7, we’ll be addressing this issue by ensuring that the slave only maps drives at the start of new jobs.

Thanks for your patience and help with figuring out this issue. Now hopefully we can get your current python paths resolved quickly. As mentioned above, please post your preload script and we’ll have a look at it ASAP!

Cheers,

Ryan

anon77753879 · September 26, 2013, 5:01pm

Hey Ryan.

Here you go. I included the MayaBatch.py that we have in our version controlled repo just in case anything was changed in it inadvertently but I don’t think it was.
jobPreLoad.zip (16 KB)