AWS Thinkbox Discussion Forums

Modo issues running on multiple slave instances

Early last year we purchased a Dell blade set up that holds 4 blades each with 2 processors. The idea was that we could run multiple slave instances on each blade, giving us some configuration flexibility, and ease some of the financial pain in regards to licensing. In general this has worked out so far. Except when it comes to Modo. We tend to get two kinds of errors from these machines alone when running modo jobs.

  1. We get jobs that failed within 1 minute of the task being started: No real error message other than
Error: Monitored managed process "modo0" has exited or been terminated.
  1. The second error is after a task has finished rendering the job will error out because it failed to save the image to disk:
Error: Modo render failed to save output. Please make sure all defined output paths are accessible

I had originally thought this had something to do with scenes that had streaming anim cache files in them like emreader or MDD files. But even scenes with just animated geo in generates these failures. They don’t fail all the time and I can’t quite figure out a pattern of when and why they fail. The same job can run through with a minimal about of errors one time and then generate 100 errors the next run through. The one weird thing that I noticed in the job logs for the test job I’m running is that these failed tasks seem to be using a whole lot more resources than tasks that succeeded.

I’ve packaged up the test scene and the anim cache files along with some debug materials and attached it to this post. I’m willing to work closely with support to try and figure this out so please let me know if there is any other file or test I can run to help out.
ModoRenderTest.rar (83.9 MB)

Hey James,

Unfortunately, I don’t have anything concrete for you at this time. So I’ll just throw some thoughts out that perhaps you’ve already considered.

The first type of error is super frustrating, cause that’s just Deadline reporting that modo has unexpectedly quit.

The second type is Deadline doing a catch all on the render animation failing. Strictly speaking, any time we see “render.animation…failed with…” we catch that error and bubble it up. The interesting part about that one is that it failed on the 4th render pass (Bacteria). Did the other three render passes save their output properly? I wonder if there’s anything different about that one compared to the other three. Are all four render pass output paths accessible by the slave?

I see in your logs that you’ve got Modo’s Render Threads set to ‘automatic’ while using multiple slaves on one machine. Have you fiddled around with that setting when testing on this machine? Wonder if they could be fighting each other for resources… Not certain how they’ll interact with each other for that.

As for the specific error code “Command ‘!render.animation {*} group:{Passes}’ failed with -2147483648”, The Foundry might be able to provide more insight on what this actually means in this context. My google-fu doesn’t seem to turn up anything related. This might take coordination from both ends to get to the bottom of this.

Cheers

Hey Morgan thanks for the response!

Yeah, I’ve been frustrated by that as well. :laughing:

This is something I haven’t looked into. This might be a separate issue that needs investigate and just got caught up in my investigation of the larger issue with multiple instances. I’ll take a look at it again tomorrow morning.

I had not considered this. I’ll play around with those settings tomorrow and run some more tests.

My first attempts at debugging this began with the Foundry since I was sure the problem was on their end. But I didn’t get any responses to my queries. I do have some direct dev contacts that I could ping them about this. So maybe I’ll give that a try.

I’ll update the thread when I learn more.

Cheers
[/quote]

I’m looking into this now and I have a question for anyone who knows. On the per slave level I see that I can set the CPU Affinity for each slave. Then at the job level I can set it to use a certain number of threads per job. So is this essentially the same thing just with a different focus?

Hello,

CPU Affinity allows you to specify specific CPUs to use. The thread count does not. CPU affinity would be used in the case of multiple slaves, or wanting to control which cores are used.

Gotcha, thanks Dwight.

Unfortunately this doesn’t seem to have cleared up the quick fail issues. :frowning:

Privacy | Site terms | Cookie preferences