7.1.2.1, with Pulse running on Linux
The short version
Pulse got stuck in some kind of messed up state for almost 2 weeks and never recovered, and it seems to have caused scheduling to fail on machines with multiple slaves running.
The long version
A few days ago, someone reported to me that Quicktime jobs weren’t picking up in Melbourne. These jobs currently run in Nuke on OSX, so I immediately suspected App Nap, but after checking, I found it was already disabled. I poked around some more and couldn’t find anything incriminating on the systems, and things like restarting the slaves and using the “Search For Jobs” command weren’t doing anything.
This morning, I noticed that one of the Nuke limit stubs was listed as being in use by one of these Quicktime slaves, even though they weren’t actually picking up any jobs. However, other jobs that required Nuke limits (e.g. actual Nuke renders) had been running fine, but they all run on machines with a single slave. I know Pulse is supposed to handle the clean-up of busted limits if it’s running, so I checked on Pulse, and I found that it was repeatedly vomiting out this loop of messages:
Attempting a hard kill of parent process with id 21 because it failed to exit.
Could not kill parent process with id 21 because 'kill' returning non-zero exit code: -1
Attempting a hard kill of parent process with id 18 because it failed to exit.
Could not kill parent process with id 18 because 'kill' returning non-zero exit code: -1
I checked the Pulse machine and found that there were 3 ‘deadlinecommand’ processes that had been stuck for anywhere from 2 to 3 weeks. The messages in the Pulse log started on August 13th, and coincidentally, that was also the last date on any of the Pulse log files on the host machine, even though the process had continued running, and continued spitting out these messages.
After I stopped the Pulse process, the orphaned limit stub was returned to the pool, and the queued Quicktime jobs were immediately picked up.
I would love to hear some theories on this one.