Maya processes hanging

awilson · September 15, 2015, 8:57pm

We’re seeing a lot of maya processes hang at this line when it starts:

2015-09-10 23:27:23: 0: INFO: Waiting until maya is ready to go

The slave then sits at that line endlessly, and if we log into the machine we see a maya process that is sitting on ~3G of ram and using no cpu cycles at all. That line of output comes before any of the maya initialization, so it seems like somehow the slave isn’t able to ever get in touch with the process, or something. That’s just a total guess, though.

Is there anything you can think of that might cause this, or any way to debug the issue? Thanks!

nrusch · September 18, 2015, 7:02pm

Sounds like it could be caused by the same underlying issue as this: forums.thinkboxsoftware.com/vie … 11&t=13380

A couple of questions( based on my running hypothesis that once this happens to a slave process, it becomes a persistent condition):

If a slave hits that problem and you manually re-queue the task, will it repeat the same behavior on subsequent tasks?
Does restarting the slave process cause it to be able to run Maya tasks properly again?
Do you notice that it only happens to slaves that have been running for a while?

awilson · September 18, 2015, 7:14pm

It seems like the first time or two that I requeue it, it will work better for a while before repeating. But eventually requeueing doesn’t do anything anymore, and it will hang every time. This is just my impression, though, I haven’t been keeping a good log of when it happens (I just wrote an automated monitoring tool this morning).

I haven’t tried this. I’ll give it a shot next time I see one exhibiting the behavior, but I rebooted all of the hosts that were having the problem this morning, so it may be a bit.

It does seem like the problem manifests itself more as time going by, but again I haven’t actually looked at the trends to be able to say for sure.

I’ll set our maya plugin to disable asynchronous output now, and see if that helps things. Thank you!

nrusch · September 19, 2015, 1:28am

I haven’t actually tested the AsynchronousStdout change yet to see if it helps, because I’m a little afraid of how the slave may block if there are long periods of time between output from the child process. I’d be interested to hear whether you notice anything along those lines.

jgaudet · September 30, 2015, 9:31pm

So the AsynchronousStdout property is actually a bit of a misnomer – regardless of whether this is True of False, the StdOut handling will be asynchronous (shouldn’t block the Slave in any way). The only thing that changes is how it’s implemented.

When it’s True (default), we use the Process.BeginOutputReadLine() function. This will trigger the callback when a new line character is reached (hence why StdOut with super-long lines can be a problem).

When it’s False, we instead use the Stream.BeginRead() function. Using this function, we (asynchronously) read in 1KB of data from the StdOut stream at a time, regardless of newlines.

So I wouldn’t be too worried about switching this from blocking the Slave process, we really should rename this property to something more appropriate, though…

awilson · September 30, 2015, 9:37pm

That makes sense. In good news, since changing this option in both our Vray and MayaBatch plugins we haven’t seen any hanging processes at all! Thanks for your help - I think we can probably call this one done!

nrusch · September 30, 2015, 9:55pm

Thanks for the update and clarification Jon. I do have one more question though.

Since there’s no timeout parameter to the BeginRead method, doesn’t that mean that if a process generates 997 bytes of output, no stdout callbacks will be applied to any of it until the read hits the full 1KB?

jgaudet · October 1, 2015, 7:48pm

Good to hear you guys aren’t experiencing this problem since switching over

And Nathan, I’m not totally sure on that to be honest – the 1KB is the max read buffer size, the function can be called if there’s less than 1KB there. It definitely gets called when the stream is closed, regardless of how much data is left. Other than that, I’d have to test to see if it gets called everytime a Write happens on the Stream, or if it waits till the buffer is full, or maybe there’s an internal timeout that flushes it after a while. The MSDN docs are unfortunately mum on how often it gets triggered (and this would be the Mono implementation anyways, which is likely slightly different from how MS did it).

Either way though, the situation you describe could happen regardless of how Deadline is handling output, depending on how the output is buffered on the writing end (child process, e.g., in Maya, Nuke, etc.).

Cheers,
Jon

nrusch · October 13, 2015, 11:25pm

Hey Jon,

Just to close the loop on this, I ran some tests with processes that generate small amounts of infrequent output (e.g. Arnold renders), and there doesn’t seem to be any kind of a buffering delay in the output handling callbacks, so I’m going to turn this loose for all our jobs. The language in the .NET docs is definitely confusing/misleading.

It would be nice to know A) if there are any potential downsides to this method over BeginOutputReadLine (maybe it’s just a question of old vs. new async APIs), and B) why the default method is encountering this issue.

Thanks.

jgaudet · October 15, 2015, 11:28pm

Hey Nathan,

As far as I know, there aren’t any downsides to this method – looking at the commit log, I suspect it was just a “new” vs “old” thing; the BeginOutputReadLine method is a lot cleaner from a code perspective. I will confirm with Ryan tomorrow though, to make sure this wasn’t done to actually fix a bug that was present in the ‘old’ way of handling output.

As for why the default method is encountering this problem, I’ve kinda touched on it a bit already. I believe it’s because that function waits for a newline character to appear before triggering the output processing – if some software/plugin that the Slave is running writes a giant amount of text on a single line, you can run into the stdout buffer getting full before a newline char comes along. I suspect this is actually a bug in Mono’s implementation of this function, since I don’t think we’ve ever seen this happen on Windows (it should probably call the handler even without a newline if the buffer is full).

If Ryan confirms that this change in output handling was just a refactor (and not a fix for an obscure bug), I think we should actually switch to default AsynchronousStdout to false (and maybe rename it while we’re at it ).

nrusch · October 16, 2015, 3:35am

I’m not entirely convinced that would explain it, for a couple of reasons:

The vast majority of the time, when I see this issue, none of the output from the task’s process is even in the slave log. Instead, the last line is just “INFO: Process is now running”, and when I manually flush the child process’ output buffer, I get all of the output from the process, indicating that no output has actually been read and processed.
Our render code doesn’t print out any massive lines of output (certainly not early in the process), and all of our jobs are run through the same stack.

That may very well be the case, but I wanted to add the above notes, since they seem to contradict the “long output lines” idea.

jgaudet · October 16, 2015, 8:32pm

What you’re describing might be a different issue, then, that would not necessarily be solved by this. It’d definitely be good to try either way, just in case it does fix it – I just confirmed with Ryan that he initially just made the change in an attempt to clean up our code, not due to any actual problems with our old method of handling output.

Out of curiosity, how are you manually flushing the stdout buffer? And for what applications/plugins have you been experiencing this? If you’re flushing the buffer on the side of the app that’s writing to the stdout stream, then that makes sense as to why Deadline’s not seeing the output (because it hasn’t actually been written to the stream yet, so there’s nothing for Deadine to read).

If that is what’s happening, this blog post goes into the ‘why’ a bit. If it is Python that you’re using, you should just be able to add the ‘-u’ flag which will force it into un-buffered mode (I think there’s an env variable you can set as well). If it’s Maya or something else, we’ll have to look into whether or not they have a similar flag that would achieve the same result.

I think Edwin was handling your initial ticket with this issue, so I’m not totally familiar with the details – you’ll have to forgive me if I’m repeating questions he’s already asked.

nrusch · October 21, 2015, 6:20am

OK, good to know. I’ve temporarily reverted to the default behavior, and I’m waiting for the issue to reveal itself again so I can poke it some more. However, in the time period where I was setting AsynchronousStdout to False, I didn’t see any occurrences of the issue. Not conclusive evidence at all yet, but I’m continuing to monitor everything.

I covered this in my original thread, but I was flushing it by cat-ing the pipe FD connected to the task process’ stdout handle (gotta love the /proc filesystem), which the slave process should have been reading from the whole time. Once I did this, the child process continues running (since the write() call it ends up blocked on can actually complete), but the slave still ignores its output. In other words, the issue is at the slave level.

We don’t use any of Deadline’s built-in job plugins, but instead run all of our jobs through a single custom plugin. I’ve caught slaves ignoring output from all sorts of underlying internal job types (Maya, Arnold, Nuke, arbitrary Python callables, etc). However, unless the task process produces enough output to fill the output pipe buffer, the task will be able to complete properly, but its job report will basically be completely empty (because the slave never bothered reading any of its output). Thus, MtoA renders are usually the ones that get stuck, because they produce a lot of output.

jgaudet · October 21, 2015, 9:16pm

Got it, I actually didn’t know you could do that. But now that you mention it, I remember you saying as much when Edwin and I were out there, too. You’re right that this would definitely be an issue on our side, in that case.

If you do catch it doing that again, could you get us a copy of the cat output? I’d be curious to see what it looks like, and it should hopefully be trivial to reproduce if we have the actual output that caused it.

nrusch · October 26, 2015, 2:04am

I don’t think the output is going to help you out here… it’s pretty basic stuff (no really long lines, special characters, etc). The problem tends to arise after a slave has been running for several days (I don’t have any concrete metrics on what the window is), and just starts happening to all jobs.

One thing I have noticed with the current slave I have in this state is that, if I connect to the slave’s log and watch the log as it picks up a task, it will actually read at least some of the output properly (it seems to give up as soon as the process stops producing output for a short interval). However, if I let the slave pick up a task and then connect to the log once it’s running, it doesn’t start reading output (the last message in the log is “Process is now running”).