We run a couple of OSX machines in each studio with 2-6 slaves, strictly for doing Quicktime encoding. These slaves get stuck 4-6 times a day in an endless loop:
2015-01-20 23:50:22: Scheduler Thread - Cancelling task...
2015-01-20 23:50:22: Scheduler Thread - Task "0_1-1" could not be found because job with ID xxxxxxxx does not exist
2015-01-20 23:50:46: Scheduler Thread - Cancelling task...
2015-01-20 23:50:46: Scheduler Thread - Task "0_1-1" could not be found because job with ID xxxxxxxx does not exist
This will go on forever, until someone catches it and tries to restart the slave. Then the fun really starts.
Attempting to restart the slave(s) via Remote Control will fail with a timeout error, and the slave process never exits; it must be killed manually.
The tail of the slave log makes it look like the slave’s info thread thinks is has shut down, but it hasn’t. The process continues running, and its status stays as “Rendering…” in the monitor.
The launcher is totally unresponsive to Remote Control commands, and can only be stopped about 50% of the time with “DeadlineLauncher -s”. Otherwise, it’s manually kill and restart.
I have a feeling that the first suggestion I’m going to get here is “Please upgrade to Deadline 7 and see if you still have these problems,” but has any of this has actually been fixed?
I don’t think upgrading to 7 would necessarily help in this specific circumstance; I don’t think we touched much of the code in this section going from 6->7. We’ve seen this issue crop up before, but we generally have a tough time replicating it on our end.
Part of the problem is that there are a couple ways a Slave could get stuck in that loop; it’s more of a symptom of “weird” stuff happening elsewhere than a problem in itself. I talked to Edwin about it, and he says that some clients have found it to happen when OSX temporarily loses network connectivity, or has trouble resolving DNS requests. I’m not totally sold that it would be the sole cause of this, though. Do you know how the tasks got requeued in the first place? Was it user-requeued, or because of Stalled Slave detection? If it was stalled slave detection, there might some credence to this.
Either way though, we could definitely be handling this situation better when it does happen. I’ll log a ticket to put in some kind of (configurable) timeout on the cancelling process to just kill the thread. We should obviously also make sure that the Slave itself stays responsive when it does get in this state… Might be a bit tougher if we can’t replicate this, though. Does it only stop responding to commands and stuff after you’ve already tried to shut it down? I’m wondering if the Slave had just closed its communication thread already based on a prior attempt, which would at least explain the timeouts/unresponsiveness.
Actually, one more question from me: Was the render process (ie, quicktime in this case) still kicking around when the Slave was in this loop? If this happens again, and if the render process is still there, can you try to manually kill it to see if that “wakes up” the Slave?
I don’t know… the fact that the slave behaves this way if “weird stuff” happens still seems like a bit of a problem. I think expecting Deadline to be operating in a vacuum of perfect network connectivity and lack of “weird stuff” from a development standpoint is asking a lot (especially given Deadline’s decentralized architecture), but I understand what you’re getting at. We don’t have network connectivity issues per se, but we do use NFS over UDP for our mounts, and OSX’s NFS support is so broken that I wouldn’t put it past the OS for problems with a mount to break general network connectivity in some other swell way. OSX is a necessary evil, heavy on the evil.
From what I’ve seen, these tend to involve jobs that are getting deleted, not tasks that are getting re-queued. One situation I’ve definitely noticed that may actually cause it consistently is when a job is deleted and then a more or less “equivalent” job is submitted almost immediately afterward (same tasks, title, etc.), so that might be worth looking at. The slave is still updating its status in the database though, so it never gets detected as stalled.
After sending a remote shutdown command, the end of the log looks like this
2015-01-23 xx:xx:xx: Scheduler Thread - Task "0_1-1" could not be found because job with ID xxxxxxxx does not exist
2015-01-23 xx:xx:xx: Scheduler Thread - Cancelling task...
2015-01-23 xx:xx:xx: Scheduler Thread - Task "0_1-1" could not be found because job with ID xxxxxxxx does not exist
2015-01-23 xx:xx:xx: Scheduler Thread - Cancelling task...
2015-01-23 xx:xx:xx: Info Thread - requesting slave info thread quit.
2015-01-23 xx:xx:xx: Scheduler Thread - Task "0_1-1" could not be found because job with ID xxxxxxxx does not exist
2015-01-23 xx:xx:xx: Scheduler Thread - Cancelling task...
2015-01-23 xx:xx:xx: Info Thread - shutdown complete
Notice there is no “Scheduler Thread - shutdown complete” message, so my guess is the deadlocked scheduler thread is what’s keeping the process alive.
Gotcha, I might have another go at reproducing this on our macs while deleting Jobs instead of just requeuing them.
Good to know. I was discussing this with Ryan and he mentioned this loop can also happen if the Slave has trouble killing its child processes for whatever reason.
I think we should be able to get a fix for this into 7.1. Ryan was a bit concerned about leaving “orphaned” render processes behind to slowly pile up, but we should be able to track those and try to kill them separately, allowing the Slave to move on to a different Job.
We can’t fix Apple’s evilness, but we should certainly be accounting for it
One more thing to note: If I send the remote shutdown command to one of these slaves, the process never dies (as described previously), but the slave will eventually be marked as stalled. I obviously have to deal with this manually right now since stalled slave restarting broke in 6.2.1.50, but if it was working as advertised, I could eventually wind up with a large number of stuck processes.
Noted. I suspect this is again because of a partial shutdown – we have dedicated thread for updating the status in the DB. If that’s gone the Slave will eventually be detected as stalled (which I suppose it is).
I’ll have a look at our forced-shutdown code, because I’m pretty sure we do have stuff that’s supposed to forcibly terminate all threads after 30 seconds… Might not be working properly with Remote Commands.
Also, just to mention it, the stuck launcher issue is still pretty undefined, but I’ll keep an eye out for it and try to gather info on when it happens and get some more concrete symptoms. My initial instinct is that it just happens after the launcher has been running for a day or so, but I’ll try to confirm as much as I can.
Just wanted to modify this slightly. In cases where the slave is stuck trying to shut down, it seems there is actually a child process entry remaining, but it is a zombie.
While the job is running, the process tree looks like this:
deadlineslave
`-- bash
`-- python
`-- python
`-- Nuke
`-- NukeQuicktimeHelper
When these tasks get stuck, from what I can tell, the cause seems to be a synchronization issue between Nuke and NukeQuicktimeHelper, which talk to each other using IPC (don’t ask…).
When I issue a remote command to stop or restart the slave, the Nuke process is killed, along with both of the python processes, but the bash and NukeQuicktimeHelper process entries remain as zombies.
The bash zombie remains parented to deadlineslave, but NukeQuicktimeHelper is detached and re-parented to launchd.
Once I kill the slave process, the bash zombie process entry is removed.
So, regarding yours and Ryan’s concerns about leaving orphaned render processes behind, I think you may just need to add specific handling to the slave’s pre-shutdown checks to allow the slave to exit if all of its child processes are zombies.
Just wanted to mention that this is definitely still an issue in 7.0, and is not limited to OSX machines. Also, in the case of stuck Linux slaves, there are no remaining child processes under the slave. The launcher sends the “shutdown” command, the slave acknowledges it and stops updating itself in the DB (i.e. eventually it gets marked as “stalled”), but the slave process never exits.