AWS Thinkbox Discussion Forums

Houdini Mantra IFD rendering

Hi,

We’re having some issues rendering ifds through deadline and I’m trying to chase down the issue, it seems to happen when we’re using compression on the ifds. If we export ifds with .sc or .gz compression then inevitably some of the tasks fail. I can load the command that Deadline is using (switching the ifd path to the original server file) on the render machine and it will process the task without fail so it would seem to be a bug of some kind.

It’s not printing the error properly when it crashes so I grabbed the text from the log before it crashed.

Any ideas on what it might be doing…?

Thanks

Nick
Job_Error_Houdini_sc.txt (6.49 KB)
Mantra_ifd_sc_error.txt (19.5 KB)

We saw this awhile ago, and Justin R dug pretty deeply in.

From the “cannot read from stdin”, I’m expecting there is some data being passed from object_merge1 via a standard output pipe from Houdini to the standard input pipe of Mantra. Now, that just won’t work in the context of Deadline since Mantra runs in its own happy little world after Houdini’s done.

One thing Justin found back at the end of August was an option “Save Binary Geometry” in the scene’s ROP node. Turning it off seemed to work there, but it may have been the solution to some other data that was being passed through. What do you think?

Hi Edwin,

Yeah, the binary data thing is something we’ve looked at, it’s turned off on the submission as that definitely breaks it. I thought this had resolved it but it got part way through and then just started breaking. It’s really strange though because it’s not all frames, most of them go through fine. One of the other errors we had pop up was to do with it thinking it had the wrong kind of compression. The fact it works on the machine outside of Deadline seems to say the file on the server is OK, I just wondered if there could be a problem when copying it over? Although I’m sure you check file integrity. :frowning:

Nick

Actually, we don’t check file integrity. Most of the time, things work out fairly well

I looked at the other log (Mantra_ifd_sc_error.txt) and it actually looks like SideFX are using Blosc for that compression, and they forgot the compiler settings for the lz4hc algorithm. Is that error throwing on specific machines as opposed to specific frames? Might be an issue with one release version.

2016-11-30 09:08:04:  0: STDOUT: Blosc has not been compiled with decompression support for 'lz4hc' format. Please recompile for adding this support.

Info on Blosc:
blosc.org/

I’d e-mail the SideFX guys anyway.

Hi,

I’d agree that it seems like it could be something like that but I can run that same command from the command line on that exact machine with no problems. I’ve attached the output…

Nick
Blosc_ifd_completed.txt (6.74 KB)

Now that is interesting! I wonder if it could be something to do with library loading… Normally Windows searches the current working directory for libraries, then custom registered locations, then the system32 folder if I’m remembering things right. We should be setting the working directory to the one that contains the Mantra executable.

The middle ground for testing this is going to be to run it through the command line submitter. That’ll be in the Slave’s environment in every way, but won’t use the Mantra plugin code. Could you throw that in and see what happens? You can find the submitter under “submit”, then “command line submission” in the Monitor.

Hi Edwin,

Ran it through the command line, again it runs fine, I’ve attached the log file. Both of these aren’t copying the file locally though.

Nick
cmd_line_log.txt (15.7 KB)

Well, we’re slowly filtering things out here.

The only other thing that stands out at this point is that the Mantra job was running with concurrent tasks (thanks for sending a Slave log snippet :smiley:). Could you try running the same job again with concurrent tasks off and also the command line job with concurrent tasks set to two and a frame range of 1-5 (you can use “” to avoid output file collisions")?

I don’t see how a library loading could be impacted by multiple applications loading. They’d share program memory I’d imagine, but that should be it and really should have zero impact on the actual program execution.

Hi Edwin,

I don’t think it’s the concurrent task, we have some that render with concurrent frames and some that don’t. It errors on both (I’ve attached an error log).

Nick
Non_Concurrent_Tasks.txt (7.26 KB)

So I am going to be 100% honest here, I am at a loss, and unfortunately Edwin, who is out of the country this week, is the person I would ask to look at this. The error that seems to be stopping this is “mantra: Error loading geometry /obj/particles_small/object_merge1 from stdin” but I am not sure why this is happening.

Man, this is odd all over the place. Nick, can you send me a test scene? I probably should have asked for this at the beginning. I’ll be seeing if I can reproduce, then passing it to the integration team.

Dumb question, but are there by any chance multiple jobs writing the entire IFD sequence, and therefore you’re getting files being written/rewritten at random times?

Hi,

Antoine - The render job is dependent on the creation job so doesn’t start until after the files are completed.

Edwin - I’ll try and make you a file but this is consistent behaviour across all our submissions.

Nick

Right, but did you check that the plugin that’s running the creation job (which I assume is running on every frame rather than once for the whole job) isn’t being told to generate IFDs for the whole sequence? e.g. frame 1 generates IFDs for frames 1-100, frame 2 generates IFDs for frames 1-100, and so on and they step on each other.

One other avenue of investigation is if the filesystem is buffering, e.g. even if it buffers for a whole second, the render job might be getting to the IFD before it’s been fully flushed to disk. This can happen with asynchronous writes, for example.

Lastly, when you go and inspect the IFD that failed, if you requeue the frame does it run successfully?

Hi Antoine,

I haven’t checked that, but I’m pretty sure it was working fine. The original job was finished sometime before the render job. And I can run the ifd through mantra with the command deadline uses on the command line. It just doesn’t run through the plugin.

Thanks
Nick

Yeah, I’m seriously hoping I can reproduce over here. The issue is that I’m not very skilled with Houdini, and Justin R who’s spent the most time with it is still away at college. :smiley:

Having something I can just submit through the farm gives me something I can tweak and play with.

OK, then that’s not likely the problem. Rereading your original post, can you successfully render a busted frame on a different machine? If you blacklist all the machines that failed a frame and rerender the whole thing, does it go through? If some other frames fail and you blacklist those (and so on), is there a set of machines that succeeds?

Hi Edwin,

So I couldn’t send you the original file as it would have been too large and is on an active project. I was also struggling making a file that would error as it seems to need a certain level of complexity! I have now successfully created a file that will error consistently though. It also seems to be the same frames that fail each time even though different machines create the ifd and all machines fail on the mantra render. I can complete the frames by running a command line job though so I know the ifds are ok. I’ve uploaded the project file to WeTransfer (350MB) https://we.tl/LXRP2wst6j, we’re using Houdini 15.0.313 and Deadline 8.0.11.2 all on Windows 7. Let me know if you need any more info…

The frames that fail for me are: 70,89-90,93-94,102,121,124,128,145-146,151,158,173,190,195,210

Thanks

Nick

Downloading now. Installing 15 on my machine to see if I can test locally via the Slave.

Okay, no luck over here for a few reasons. It’s complaining about “/out/logo/merge1” over here, and there were a very large number of unknown attributes when the hip file loaded.

Maybe it’d be better if we set up a call since this is so hard to reproduce? It’s been fairly quiet over here while people are ramping back up after the break.

Privacy | Site terms | Cookie preferences