Got another slave issue that has been showing up on various slaves for awhile now (this is using 6.2.0 on Linux). Here’s the situation on one slave that’s currently stuck like this:
- Slave has been running for a little over 6 days (i.e. not long).
- From the user’s perspective, it finishes a task, dequeues another task, starts the child process, and then everything just stops.
- The process it spawns is doing some bootstrapping, and then starting a Maya render process (this is a custom plugin).
- Memory usage is not high.
The process tree looks roughly like this:
slave
\_ bash
\_ python
\_ maya
- When inspecting the tree of processes using strace, it appears that the slave is failing to properly flush or otherwise handle output from its child process(es).
- The Maya process is blocked on a
write
syscall (trying to write some of its normal process output to stdout during startup). - The slave and launcher processes are both sitting on
futex(xxx, FUTEX_WAIT_PRIVATE, ...)
syscalls. -
Most tellingly: The output shown in the slave log (when connecting remotely or just
tail
-ing its local log file) is not up to date with the latest output from the child processes.
If I cancel one of these stuck tasks (or it times out after being stuck for the duration of the job’s task timeout), the slave sometimes picks up its next task without incident, even if that ends up being the same task. Other times, the hanging behavior will repeat itself, wasting another N hours.