The Deadline launcher service doesn’t always start or restart when prompted to. I’ve seen several different failure cases that can cause installs to fail to complete properly, and overall, one of the issues below has affected about 20-40% of our farm when attempting to install the new client.
In some cases, it looks like the launcher has received a remote command to shutdown, but it’s just sitting and either spinning on one core or doing nothing. When it’s spinning on one core, it looks like the main thread is trying to signal one of the other threads (I see a ton of tgkill calls when strace-ing it), and in either case the process seems to be stuck in an infinite loop. Recovering from this requires a SIGKILL and then a manual start of the launcher service.
In the other case, the service script is blocked on the deadlinelauncher -shutdownall call, because the launcher is (again) just sitting there doing nothing. Killing the stuck launcher allows the service script to continue, and the service seems to come up fine after that.
FYI. The deadlinelauncher init.d script has had a major overhaul and is included in 8.1.5.2. The changelog says:
Re-worked the Linux ‘init.d’ script for the Launcher Service so that it is more consistent at shutting down and starting the Launcher, regardless of whether or not the Launcher has been restarted externally or crashed.
I’ll look at the changes to the init script in 8.1 and try to give it some controlled testing, but I don’t think I want to commit to a production-level rollout until it’s a few releases in. If the script differences prove to be significant and actually handle these failure cases, I’ll probably see if I can back-port them (if necessary) and patch our 8.0 installations. I was really hoping Deadline’s stability and reliability wouldn’t take a step back in 8.0 though, especially after this many releases.
OK, after looking the new script over, I don’t think it’s going to fix any of the issues with deadlinelauncher, deadlineslave, and deadlinecommand hanging and/or failing to exit at various times. Unless more work has been done on the reliability of the actual binaries, I would not consider this resolved in 8.1.