AWS Thinkbox Discussion Forums

Pulse stuck trying to kill nonexistent processes, blocking slaves

7.1.2.1, with Pulse running on Linux

The short version
Pulse got stuck in some kind of messed up state for almost 2 weeks and never recovered, and it seems to have caused scheduling to fail on machines with multiple slaves running.

The long version
A few days ago, someone reported to me that Quicktime jobs weren’t picking up in Melbourne. These jobs currently run in Nuke on OSX, so I immediately suspected App Nap, but after checking, I found it was already disabled. I poked around some more and couldn’t find anything incriminating on the systems, and things like restarting the slaves and using the “Search For Jobs” command weren’t doing anything.

This morning, I noticed that one of the Nuke limit stubs was listed as being in use by one of these Quicktime slaves, even though they weren’t actually picking up any jobs. However, other jobs that required Nuke limits (e.g. actual Nuke renders) had been running fine, but they all run on machines with a single slave. I know Pulse is supposed to handle the clean-up of busted limits if it’s running, so I checked on Pulse, and I found that it was repeatedly vomiting out this loop of messages:

Attempting a hard kill of parent process with id 21 because it failed to exit. Could not kill parent process with id 21 because 'kill' returning non-zero exit code: -1 Attempting a hard kill of parent process with id 18 because it failed to exit. Could not kill parent process with id 18 because 'kill' returning non-zero exit code: -1

I checked the Pulse machine and found that there were 3 ‘deadlinecommand’ processes that had been stuck for anywhere from 2 to 3 weeks. The messages in the Pulse log started on August 13th, and coincidentally, that was also the last date on any of the Pulse log files on the host machine, even though the process had continued running, and continued spitting out these messages.

After I stopped the Pulse process, the orphaned limit stub was returned to the pool, and the queued Quicktime jobs were immediately picked up.

I would love to hear some theories on this one.

I am not sure what would be stopping Pulse in your example, but in 7.2 we will be making 3 attempts and then moving on to stop Pulse from getting stuck.

Here’s my first theory:

From the Linux man page on kill, the kill() function shall fail if:
EINVAL : The value of the sig argument is an invalid or unsupported signal number.
EPERM : The process does not have permission to send the signal to any receiving process.
ESRCH : No process or process group can be found corresponding to that specified by pid.

In the absence of having the errno value, I would strongly suspect it was the last condition. The process may have exited just before the kill command was executed, resulting in ESRCH and causing kill to return -1, which caused Pulse to retry, which resulted in the same error, which caused Pulse to retry, repeat ad infinitum. As Dwight mentioned, this is addressed in beta build 7.2.0.15 such that Pulse only makes three attempts before moving on.

Here’s my second theory: Swamp gas from a weather balloon was trapped in a thermal pocket and reflected the light from Venus. – MIB

Thanks guys. I’ll cross my fingers that your theory is correct, and won’t worry about this until we can get on to 7.2.

Privacy | Site terms | Cookie preferences