AWS Thinkbox Discussion Forums

Slaves losing contact with their process

I’ve been seeing this on our hosts that are running the 7.2 slave (7.2.0.13 on Ubuntu 14). They’ll occasionally get into a weird state where they apparently lose contact with the process they’ve spawned. In some cases the process continues to run to completion, but at some point they stop getting output and they never realize when the process finishes running. Instead, they’ll just stay in a running state until manually kicked. Here’s example output from a log of a task from this morning that got into that state. Vray updates the log every 1/5 of a second, regardless of whether anything has happened.

[2015/Aug/17|01:01:19] Rendering image...:  0.39% [ 0h 15m 20.6s] [64h 41m 55.0s est]
[2015/Aug/17|01:01:19] Rendering image...:  0.39% [ 0h 15m 20.8s] [64h 42m 45.5s est]
[2015/Aug/17|01:01:19] Rendering image...:  0.39% [ 0h 15m 21.0s] [64h 43m 36.4s est]
[2015/Aug/17|01:01:19] Rendering image...:  0.39% [ 0h 15m 21.2s] [64h 44m 27.0s est]
[2015/Aug/17|01:01:20] Rendering image...:  0.39% [ 0h 15m 21.4s] [64h 45m 17.6s est]
[2015/Aug/17|01:01:20] Rendering image...:  0.39% [ 0h 15m 21.6s] [64h 46m  8.2s est]
[2015/Aug/17|01:01:20] Rendering image...:  0.39% [ 0h 15m 21.8s] [64h 46m 58.8s est]
2015-08-17 01:08:54:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 01:21:05:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 01:33:21:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 01:45:42:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 01:58:08:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 02:22:50:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 02:48:07:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 03:38:05:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 04:02:49:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 04:15:13:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 04:27:37:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 04:52:30:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 05:29:51:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 06:19:42:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 06:32:14:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 06:57:32:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 07:22:30:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 07:47:37:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 08:24:36:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 08:37:08:  Skipping thermal shutdown check because it is not required at this time
2015-08-17 09:02:10:  Skipping thermal shutdown check because it is not required at this time

You can see that it stops getting the output of the vray process, but thinks it’s still alive because it’s not performing any of the idle cleanup tasks. I confirmed that vray was still rendering on the machine. In this case, the deadlineslave process was taking something like 60% of a cpu (vs. the usual 2%), so maybe it was stuck in an infinite loop, or something?

I can’t really reproduce it, since it’s random, but I see it a few times a day and it only happens on our few slaves that are testing 7.2. Sorry I don’t have much to go off of. Let me know if there’s anything else that would help you track the problem down, and I’m happy to try to debug next time it happens.

Does this happen with multiple different renderers? Or is it always with Vray? Are you guys using Vray standalone or vray for maya/max/etc?

Also, would it be possible to get the full log for the job you posted here? I know some of our clients have had issues with StdOut buffers getting filled up when using verbose renderers, and subsequently getting ‘stuck’; if that’s what’s happening here, it’d be good to know what some of the Plugin initialization stuff was (which would be higher up in the log).

In the meantime I’ll take a longer look at output code to see if I can’t find out what it might be getting stuck on.

I’ve seen it both on standalone vray jobs and on maya jobs. I don’t remember what job this was, but I’ll try to dig it up again later today.

Here are two examples - one where the vray process started and then got lost, and one where it seems like the vray process was never actually able to output anything.
log2.txt (2.08 KB)
log1.txt (463 KB)

We’re still seeing this on 7.2.0.14. We frequently see maya processes that get to this line:

2015-09-10 23:27:23: 0: INFO: Waiting until maya is ready to go

and then hang there forever. Is there any way to automatically requeue a frame based on having a certain last line of output a certain amount of time ago (i.e. if that was the last line of output and it’s been more than 15 minutes since then)? Every morning I come in to find a few frames hung, and today I was unlucky and it hit a couple of simulations that have long processes that depend on them.

This hit us really badly over the weekend, and we lost about half of our farm to hung processes. I’ve had to downgrade us to 7.1 for now. Please let me know if anyone has ideas on what might be causing the maya process to hang when launching (or how to even begin to debug it) and I’ll run a test with a couple of our hosts. Thank you!

Yeah, sorry for letting this sit for so long :frowning:

After looking at your logs, and the plugin code, I think this might be related to very long output lines going through Standard output. In log1.txt I noticed a line that’s 227k (!) characters long. I suspect that because our default code parses plugin output on a per-line basis, the output buffers are getting filled up before a line ending is reached with sizes like these.

There is a potential workaround if this is the issue – you’ll have to make a modification to your affected plugins (Maya Batch, Maya Cmd, Vray Standalone, etc). More specifically, you’ll need to edit any ‘InitializeProcess’ function in these files to have the following line in them:

self.AsynchronousStdout = False

As an example, the MayaBatch plugin (located at ‘plugins/MayaBatch/MayaBatch.py’ in the Deadline repository) has a couple of these. The first one, in the MayaBatchPlugin, around line 114 or so, here it is with the above line added:

    ## Called by Deadline to initialize the process.
    def InitializeProcess( self ):
        # Set the plugin specific settings.
        self.SingleFramesOnly = False
        self.PluginType = PluginType.Advanced
        self.AsynchronousStdout = False # <--- Add this line

There’s also another one, which represents the Maya process itself, in the MayaBatchProcess (around line 1310):

[code] def InitializeProcess( self ):
self.ProcessPriority = ProcessPriorityClass.BelowNormal
self.UseProcessTree = True
self.PopupHandling = True
self.StdoutHandling = True
self.AsynchronousStdout = False # <— Add this line

    #....

[/code]

This change switches the plugins to handle StdOut more regularly than on a per-line basis, and will hopefully solve this problem for you guys.

Let me know how it goes!

Cheers,
Jon

That makes sense for the vray job - I’ll give that a shot. I think the maya bug is a different issue altogether, so I’ll move that to a different thread. Thanks!

Privacy | Site terms | Cookie preferences