Major launcher/slave problems

nrusch · June 26, 2014, 7:36pm

I’m seeing some major (read: show-stopping) problems with launchers and slaves, which I’ll try to describe below.

I came in this morning to find that two slaves I started yesterday (one of which is running on my machine) had stalled. Neither had picked up any work (there isn’t any), and they were the only existing slaves at the time. Looking at the processes on the machines showed that the launcher was still running, so I figured I would try starting the slaves up again remotely. However, every attempt to do so resulted in a timeout error.

I killed and started the launcher again in the exact same manor it was run before (forked, no GUI, as the same user), and as defined in the machine’s deadline.ini, it started a slave. However, even though the launcher forked properly, the slave didn’t, so when I exited that shell, it took the slave down. So I tried the remote-start again (since the launcher was still running), and got another timeout error.

I then tried killing and restarting the launcher (again, forked), but this time with the -noslave flag… and it still started a slave. Again, since this slave wasn’t forked properly, it died as soon as I closed my shell, but the launcher kept running. Again, attempting to remotely launch the slave resulted in a timeout.

Finally, I restarted the launcher one more time, again with the -noslave flag, and this time it didn’t start a slave. I was able to remotely start a slave through the monitor… but after that point, the launcher timed out on every remote operation, and eventually the slave just… went away. Every launcher operation now times out.

This happens on all 3 machines I’ve tried it on so far, all running Fedora 19 and Deadline 6.2.0.32. Our machines have no network restrictions, as they do not have any external access.

Help?

P.S. We haven’t even started using Deadline for anything yet.

nrusch · June 26, 2014, 7:43pm

Also, when I try to run a remote command, the monitor hangs for 10-20 seconds before actually executing it.

rrussell · June 26, 2014, 8:06pm

When you get the timeouts, can you check the launcher log on the machine you were trying to connect to to see if it makes any mention of receiving the command? You can find the logs in /var/log/Thinkbox/Deadline6.

That’s strange that it takes 10-20 seconds before the command gets executed. It seems like the Monitor is having issues connecting to the Launcher on the remote machine. I wonder if it’s taking longer than normal to resolve the remote machine’s name to an IP. In the Slave Settings in the Repository Options, can you try enabling the Use Slave’s IP address for Remote Control option and restart the Monitor to see if that helps at all?

Cheers,
Ryan

nrusch · June 26, 2014, 8:59pm

There’s nothing in the logs (I should’ve mentioned I’ve been looking through them).

I’ve kept trying to narrow this down, and basically, if I start a forked launcher process, as soon as I disconnect from that machine (if I’m SSH-ed), or close the shell it was launched from (if it was started locally), the launcher goes completely comatose. As long as I don’t do either of those things, the launcher responds to remote commands as expected.

Also, if I start a forked launcher, then start a slave using the remote command, and then disconnect/close the shell, the slave will eventually die for no reason (again, with nothing in the logs).

So right now, the only way to start a launcher is through the service interface, although I can’t stop it without killing it. Not a big deal for now I guess. However, I’ve noticed that when I start the launcher this way, sometimes it decides to start 2 slaves… and I still don’t know why the launcher seems to ignore the -noslave flag once in awhile.

I guess this issue isn’t consistent… sometimes it hangs, sometimes it doesn’t.

nrusch · June 26, 2014, 9:12pm

Here’s an example of what I’m talking about. After killing a stuck launcher and then running service deadline6launcherservice start, this is the list of deadline-related processes that were started:

qmaster   5263  9.6  0.8 839604 33452 ?        Sl   14:01   0:00 mono --runtime=v4.0 /usr/local/Thinkbox/Deadline6/bin/deadlinelauncher.exe -nogui
qmaster   5349 12.5  1.1 919696 42108 ?        Sl   14:01   0:00 mono --runtime=v4.0 /usr/local/Thinkbox/Deadline6/bin/deadlineslave.exe -nogui
qmaster   5350 14.8  1.4 933256 54516 ?        Sl   14:01   0:01 mono --runtime=v4.0 /usr/local/Thinkbox/Deadline6/bin/deadlineslave.exe -nogui -name 01
qmaster   5354 12.7  1.1 920096 42008 ?        Sl   14:01   0:00 mono --runtime=v4.0 /usr/local/Thinkbox/Deadline6/bin/deadlineslave.exe -nogui -name 02
qmaster   5373 12.5  1.1 919604 43364 ?        Sl   14:01   0:00 mono --runtime=v4.0 /usr/local/Thinkbox/Deadline6/bin/deadlineslave.exe -nogui
qmaster   5376 12.8  1.1 920108 42172 ?        Sl   14:01   0:00 mono --runtime=v4.0 /usr/local/Thinkbox/Deadline6/bin/deadlineslave.exe -nogui -name 01
qmaster   5385 12.8  1.1 920076 43272 ?        Sl   14:01   0:00 mono --runtime=v4.0 /usr/local/Thinkbox/Deadline6/bin/deadlineslave.exe -nogui -name 02

On this machine, I had previously added two new slaves (-01 and -02) and then deleted the default slave (). However, as you can see, starting the launcher doesn’t just result in duplicate slave processes for the slaves I want to start, but it also starts two additional slave processes that I don’t want.

If I then do the following:

Stop the slaves remotely (which only stops half of them… the launcher seems to be unaware of the duplicates)
Re-delete the slave remotely (via the Monitor)
Kill (SIGTERM) the duplicate slaves and the launcher (which, as a reminder, is not stuck at this point… I just don’t have another way to stop it)
Re-run the service start command

…It only starts the proper slaves (-01 and -02).

qmaster 6211 0.2 0.8 839664 31736 ? Sl 14:08 0:01 mono --runtime=v4.0 /usr/local/Thinkbox/Deadline6/bin/deadlinelauncher.exe -nogui qmaster 6225 0.3 1.6 940780 62756 ? Sl 14:08 0:01 mono --runtime=v4.0 /usr/local/Thinkbox/Deadline6/bin/deadlineslave.exe -nogui -name 01 qmaster 6230 0.3 1.5 932968 57272 ? Sl 14:08 0:01 mono --runtime=v4.0 /usr/local/Thinkbox/Deadline6/bin/deadlineslave.exe -nogui -name 02

So this whole “stuck launcher” business seems to cause bad behavior on the next run of the launcher as well.

rrussell · June 27, 2014, 1:01pm

I think you’re running into three issues here. The first is the issue with not being able to stop the launcher service without killing it. This should already be fixed in our internal build of version 7, as we’ve addressed a few issues in our init script.

The second and third are somewhat related, and is something else that has been recently brought to our attention (and is already fixed in our internal build of version 7). The second issue is that our slave initialization mutex lock wasn’t working on Linux or OSX. We use this mutex to ensure that only one slave is initializing at a time, which allows slaves to determine if another slave is already running. Because the lock is not working, it’s possible for two slaves with the same name to start up at same time because they initialize at the same time, and can’t tell if another one is running.

The third is related to the Restart Slave If It Stalls feature in the Launcher. The multi-slave problem above is likely causing slaves to end up in the stalled state. When the launcher starts up, and the Restart Slave If It Stalls option is enabled, it will immediately try to launch the slave if it’s currently stalled. That’s why when you pass the -noslave argument, a slave still gets launched. We’ve addressed this issue by having the Launcher wait 5 minutes before it starts trying to relaunch stalled slaves. That way, if the launcher is already launching the slave at startup, the slave has time to clean up after itself.

For now, you should be able to mitigate this slave/launcher behaviour by disabling the Restart Slave If It Stalls option for your Launchers. Then when you are able to upgrade to version 7 (beta should be starting in July), these issues should all be fixed.

Sorry for pains that this is causing!

Cheers,
Ryan

nrusch · June 27, 2014, 4:44pm

Thanks for the information Ryan. However, I think there are still two outstanding issues: The launcher process hanging when its spawning TTY goes away (even when forked), and the dying slave.

Also, is the current policy on patch releases (none) going to stay the same through the 7.x series and beyond? Issues like this are pretty problematic, and waiting 6 months for the next major beta (7.1, 7.2, etc.) with its own set of new bugs isn’t very attractive…

rrussell · June 27, 2014, 5:33pm

Right! Can you post the logs from the slaves that died, and from the launcher when it hangs? When the slaves died, were they launched from the launcher while it was running as a service?

We can definitely re-evaluate. This isn’t a hard-set policy, so I’ll bring this up internally. We’ve recently overhauled our internal development pipeline, and it’s set up so that doing patch releases will be much easier in the future.

Cheers,
Ryan

cbond · June 27, 2014, 6:03pm

Also, is the current policy on patch releases (none) going to stay the same through the 7.x series and beyond?<<

this doesnt align with our mission to provide remarkable support, so i see this changing. as ryan hinted at, we’ve invested and are continuing to invest in our dev pipeline - exciting things are happening there which will affect products like deadline in a positive way.

cb

nrusch · June 27, 2014, 6:22pm

Here’s the launcher log:

2014-06-27 10:45:01: BEGIN - ws-027.luma-pictures.com\qmaster 2014-06-27 10:45:01: Deadline Launcher 6.2 [v6.2.0.32 R (2563d5bc8)] 2014-06-27 10:45:01: Launcher Thread - Launcher thread initializing... 2014-06-27 10:45:01: Updating Repository options 2014-06-27 10:45:01: - Remote Administration: enabled 2014-06-27 10:45:01: - Automatic Updates: enabled 2014-06-27 10:45:01: Checking for stalled slaves on this machine 2014-06-27 10:45:01: Launcher Thread - Remote administration is enabled 2014-06-27 10:45:01: Launcher Thread - Launcher thread listening on port 17060 2014-06-27 10:45:06: Updating Repository options 2014-06-27 10:45:06: - Remote Administration: enabled 2014-06-27 10:45:06: - Automatic Updates: enabled 2014-06-27 10:45:06: Checking for stalled slaves on this machine 2014-06-27 10:45:16: ::ffff:100.100.104.82 has connected 2014-06-27 10:45:16: Launcher Thread - Received command: LaunchSlave ws-027 2014-06-27 10:45:16: Local version file: /usr/local/Thinkbox/Deadline6/bin/Version 2014-06-27 10:45:16: Network version file: /Volumes/luma/_globalAssets/production/deadline/repo-la-6/bin/Linux/Version 2014-06-27 10:45:16: Comparing version files... 2014-06-27 10:45:16: Version files match 2014-06-27 10:45:16: Launching Slave: ws-027 2014-06-27 10:45:16: Launcher Thread - Responded with: Success|

And the slave log (with verbose logging enabled):

2014-06-27 11:04:55: BEGIN - ws-027.luma-pictures.com\qmaster 2014-06-27 11:04:55: Deadline Slave 6.2 [v6.2.0.32 R (2563d5bc8)] 2014-06-27 11:04:55: The current PYTHONPATH environment variable will be modified to avoid potential Python conflicts 2014-06-27 11:04:55: Scanning for auto configuration 2014-06-27 11:04:58: Auto Configuration: No auto configuration could be detected, using local configuration 2014-06-27 11:04:58: Connecting to repository 2014-06-27 11:04:58: Info Thread - Created. 2014-06-27 11:04:58: Trying to connect using license server '@sv-lic13.luma.la'... 2014-06-27 11:04:58: License obtained. 2014-06-27 11:04:58: Purging old logs 2014-06-27 11:04:58: Scheduler Thread - Slave initialization complete. 2014-06-27 11:04:58: Scheduler Thread - Performing house cleaning... 2014-06-27 11:04:58: Performing Pending Job Scan... 2014-06-27 11:04:58: Pending Job Scan - Loading pending and active jobs 2014-06-27 11:04:58: Pending Job Scan - Loaded 0 pending and active jobs in 3.570 ms 2014-06-27 11:04:58: Pending Job Scan - Done. 2014-06-27 11:04:58: Performing Stalled Slave Scan... 2014-06-27 11:04:58: Stalled Slave Scan - Loading slave states 2014-06-27 11:04:58: Stalled Slave Scan - Loaded 4 slave states in 1.355 ms 2014-06-27 11:04:58: Stalled Slave Scan - Scanning slave states 2014-06-27 11:04:58: Stalled Slave Scan - Cleaned up 0 stalled slaves in 124.000 μs 2014-06-27 11:04:58: Stalled Slave Scan - Done. 2014-06-27 11:04:58: Scheduler - Performing Job scan on Primary Pools with scheduling order Pool, Priority, First-in First-out 2014-06-27 11:04:58: Scheduler - Job chooser found no jobs. 2014-06-27 11:04:58: Scheduler - Performing Job scan on Secondary Pools with scheduling order Pool, Priority, First-in First-out 2014-06-27 11:04:58: Scheduler - Job chooser found no jobs. 2014-06-27 11:04:59: Scheduler Thread - Seconds before next job scan: 17

As you can see, there isn’t much to go on.

Finally, here are the steps I’m using to reproduce:

SSH into a machine, making sure no launcher or slave(s) are running on it.
Start a forked launcher using: deadlinelauncher -nogui -noslave &. Confirm that no slave was started; if one was (due to the stalled slave restart issue you mentioned), shut both the launcher and slave down and restart the launcher so it behaves as expected.
While still SSH-ed in, start a slave on the target machine remotely via the Monitor. The slave’s output gets piped to my SSH TTY.
Close the SSH connection. The launcher process is now completely unresponsive to remote commmands (including running deadlinelauncher -shutdown on the same machine).
Wait a little while (10-20 seconds?). Eventually, the slave process just dies (monitored via a second SSH TTY).

nrusch · June 27, 2014, 6:25pm

Thanks guys, I appreciate your willingness to consider a change here.

rrussell · June 27, 2014, 6:31pm

Thanks for the steps to reproduce! We’ve logged this as an issue, and we’ll look into it. I’m kind of surprised we’ve never run into this before…

Cheers,
Ryan

nrusch · July 7, 2014, 6:12pm

More to add to this.

When a launcher is automatically started after the client is installed, it’s also starting 2 slave processes with the same name, so it’s impossible to stop the slaves without manually killing the extra processes on each machine (since the launcher isn’t aware of them).

These machines should be auto-configured to not restart stalled slaves (to try and mitigate the damage from the aforementioned process communication issues), but upon checking the slave logs for both slave processes (one of which shouldn’t even exist), I’m seeing this across the board:

2014-07-07 09:56:53: Scanning for auto configuration 2014-07-07 09:56:53: Unable to pull configuration from Pulse: 2014-07-07 09:56:53: Address already in use 2014-07-07 09:56:53: Auto Configuration: No auto configuration could be detected, using local configuration 2014-07-07 09:56:53: Connecting to repository
Again, this is in the log for both slave processes.

After restarting the slaves, the auto-configuration still fails, but I’ll start a new thread for that.

eamsler · July 15, 2014, 7:17pm

I’m going to just jump in here at random.

When running applications with that final ampersand, this doesn’t protect the applications you’ve run from being killed when you disconnect from the shell. Here’s a nice explanation and workaround:
superuser.com/a/152727/60845

What you should be doing for starting and stopping the launcher is to use the service command.

Try running this instead:
sudo service deadline6launcherservice start

That’ll be safe from disconnecting your SSH session.

nrusch · July 15, 2014, 8:01pm

Yes, as I mentioned previously, the service method is the only safe way to start the launcher. However, the service script is broken in that it doesn’t allow the launcher to be stopped symmetrically (separate issue). Once this is fixed, it might also be nice if the service script had an option to also shut down any running slaves before taking the launcher down.

I did find the nohup solution as well, but thanks for the reply. None of this would really be a problem if the launcher wasn’t spawning an extra AWOL slave on (initial) launch and requiring the user to clean up the processes to get things to work as expected.