I am regularly seeing instances where the deadlineslave process will completely stop reading from the pipe it has connected to the stdout handle of the child process it has spawned to execute a task. The result of this is that once the pipe buffer has been filled, the child process will block indefinitely trying to write to it.
If I manually empty the buffer myself (using cat /proc/SLAVE_PID/fd/PIPE_FD), the child process can continue running (all of the data that was in the pipe is obviously lost from the task log). However, until the child process exits, this pattern can repeat itself if the child ends up writing a lot more data to its stdout, since the slave is basically ignoring the pipe. Once the child exits, the slave continues, and seems to pick up the next task normally. However, this process can repeat itself on the same slave some time later.
So, we took a look in the Deadline code and there are a few modes Deadline can capture standard output. The problem today is that the current mode is hard coded. Ryan’s looking into how we can make that a toggleable option in the render script so you can test it.
I’ll try building a Fedora 19 VM today and see if I can reproduce what you’re seeing. You’ve brought it up before, so we definitely need a way to reproduce and deal with this problem.
Alright, it came in during the 7.1 beta, so you’re good.
The property to switch the modes is “AsynchronousStdout” and it defaults to ‘True’. We can also disable it altogether using “StdoutRedirection = False”, but that’s going to disable everything including progress reporting.
So! Next step is to set that property in the ManagedProcess. For Maya, I think you can add this to InitializeProcess():
OK, I can give that a try. We’re using a custom “Simple” plugin though, so does that attribute still apply? And what potential problems could be introduced by setting it to False? I don’t want to end up doing something like blocking the slave’s main event loop if there is no output generated for a long period of time.
The difference between the two I think is that one does do a blocking request. It might make sense to make a copy of the plugin and run some parallel job as a test if you have the cycles to spare. I don’t want you folks missing anything either.
OK, and presumably that’s a blocking read with a timeout, so long periods of inactivity won’t completely hose the slave? I’d just like to be sure before I go too deep.
I will try to do a sandboxed test, but I’ll have to wait until a slave starts to exhibit the problem, and then submit a job to that slave with the updated plugin.