Slave causes child process to hang

Got another slave issue that has been showing up on various slaves for awhile now (this is using 6.2.0 on Linux). Here’s the situation on one slave that’s currently stuck like this:

  • Slave has been running for a little over 6 days (i.e. not long).
  • From the user’s perspective, it finishes a task, dequeues another task, starts the child process, and then everything just stops.
  • The process it spawns is doing some bootstrapping, and then starting a Maya render process (this is a custom plugin).
  • Memory usage is not high.

The process tree looks roughly like this:

slave \_ bash \_ python \_ maya

  • When inspecting the tree of processes using strace, it appears that the slave is failing to properly flush or otherwise handle output from its child process(es).
  • The Maya process is blocked on a write syscall (trying to write some of its normal process output to stdout during startup).
  • The slave and launcher processes are both sitting on futex(xxx, FUTEX_WAIT_PRIVATE, ...) syscalls.
  • Most tellingly: The output shown in the slave log (when connecting remotely or just tail-ing its local log file) is not up to date with the latest output from the child processes.

If I cancel one of these stuck tasks (or it times out after being stuck for the duration of the job’s task timeout), the slave sometimes picks up its next task without incident, even if that ends up being the same task. Other times, the hanging behavior will repeat itself, wasting another N hours.

Hey sir! Sorry for the delay here, I meant to respond yesterday morning.

Thanks for being so thorough here. I guess the next step is working through each part of the chain. I’ve seen weird problems with standard output going missing for Python running through Bash.

Would you be willing to post an example of this chain for me so I can try running it on our farm? Basically just the Bash and Python scripts.

I’m bad for missing forum stuff, so if you ever need to ping me, you can send things over to edwinamsler@thinkboxsoftware.com.

Hey Edwin, no worries about the delay. I’m going to follow up with you via email.