Launcher stops logging, doing anything

nrusch · November 3, 2014, 5:58pm

I came in this morning to find about 10 stalled slaves on one of our farms. According to the “Last Status Update” timestamp, some of them have been offline for over 9 hours, in spite of having their “RestartStalledSlave” config entry set to true.

I checked out a few of these machines and found a worrisome pattern.

The launcher process is still running. Since these are render nodes, it has been running for over a month.
The launcher logs stopped 6 days ago. Today is 11/4 (in Melbourne), but the last log on any of the machines is from 10/29.
The launcher process is shown as using over 1GB of virtual memory.

Running a remote LaunchSlave command still works to start the slaves again, but this is really not great behavior to see.

rrussell · November 3, 2014, 6:49pm

This is definitely not ideal behavior. We have never seen a case where the launcher stops writing logs before, and can only assume that whatever is causing the launcher to use 1 GB of memory is also affecting it’s ability to write logs. When you restart the launcher, does the logging start working properly again?

nrusch · November 3, 2014, 7:34pm

Yes, it does.

However, it gets more interesting: When I restarted the launcher service (sudo service deadline6launcherservice restart), it created a 0000.log file for the current date that only contains the record of the shutdown process, and then started fresh in a 0001.log file for the new process.

The 0000.log file just contains these lines:

2014-11-04 04:44:49: BEGIN - rd-003.luma.mel\qmaster 2014-11-04 04:44:49: ::ffff:10.20.102.3 has connected 2014-11-04 04:44:49: Launcher Thread - Received command: StopLauncher 2014-11-04 04:44:49: Shutting down

I also double-checked the timestamps in and on the last log file before today (from 10/29) to make sure that really was the last time it was modified.

rrussell · November 4, 2014, 2:25pm

I wonder if the thread that flushes text to the log died or was deadlocked. We do one final synchronous flush when the launcher closes, so that would explain why that last little bit showed up in the log, but why nothing was getting printed out prior to that.

Do you have the log from the last day the launcher was properly writing to its log? I’m curious to see what it was doing before it got messed up.

Thanks!
Ryan

nrusch · November 4, 2014, 5:28pm

Just normal launcher-y stuff. Here’s the tail:

2014-10-29 22:01:20: Checking for stalled slaves on this machine 2014-10-29 22:06:20: Updating repository options 2014-10-29 22:06:20: - Remote Administration: enabled 2014-10-29 22:06:20: - Automatic Updates: enabled 2014-10-29 22:06:20: Checking for stalled slaves on this machine 2014-10-29 22:11:20: Updating repository options 2014-10-29 22:11:20: - Remote Administration: enabled 2014-10-29 22:11:20: - Automatic Updates: enabled 2014-10-29 22:11:20: Checking for stalled slaves on this machine 2014-10-29 22:16:20: Updating repository options

rrussell · November 5, 2014, 1:51pm

Thanks for the logs. We’ve made attempts to reproduce this with Deadline 6.2 and 7.0 on Linux and Windows, but so far we haven’t had any luck. We’re going to monitor our Launchers over the next week or two to see how the memory usages changes.

Cheers,
Ryan

nrusch · November 5, 2014, 5:18pm

OK. Remember, these launchers were running for over a month, on machines that may have run out of available physical memory at certain points.

nrusch · November 6, 2014, 7:19pm

I’ve got two other machines exhibiting this same behavior whose launchers have only been running for around 8 days, so the running time may not be as much of a factor as I first thought.

rrussell · November 6, 2014, 7:33pm

Could you grab the logs from every day for these launchers? We can go through them and see if anything stands out. Also, are these launchers using over a GB of memory too?

nrusch · November 6, 2014, 8:21pm

Another twist: The system reported that the deadlinelauncher process had been running since 10/29, but there were no other logs after 10/29 until I told it to shut down.

Here are what I believe are the logs for the launcher process’ last run, including one instance where someone told the slave to cancel a task. I restarted it without checking the memory usage.
rd-041_logs.zip (8.49 KB)

rrussell · November 6, 2014, 8:50pm

Thanks!

nrusch · November 12, 2014, 7:49pm

Since this keeps happening, I wanted to follow up to clarify the memory usage aspect a little bit. In my original post, I mentioned that the launcher was using over 1GB of virtual memory. I don’t know that that figure is particularly unusual, since a fresh launcher seems to show a virtual size of about 750 MB.

The resident set size (a.k.a. private memory) is actually slightly more interesting. The RSS of a fresh launcher is around 31 MB, but on the node with the dead launcher that I’m looking at right now, the RSS is only 8.4 MB.

rrussell · November 12, 2014, 8:43pm

Yeah, that matches up to what we’re seeing here in terms of normal memory usage. We’ve had the launcher running for almost a week on a couple test linux machines and the memory has remained consistent. We’re testing with Deadline 7 right now though…

nrusch · November 12, 2014, 9:58pm

Don’t get me wrong… the launcher still has some leaking issues. I’ve got one here right now that’s been running since 10/31, and is using over 2GB of private memory.

nrusch · November 12, 2014, 9:58pm

Don’t get me wrong… the launcher still has some leaking issues. I’ve got one here right now that’s been running since 10/31, and is using over 2GB of private memory.

rrussell · November 13, 2014, 5:25pm

It might very well be something that’s been addressed in Deadline 7, as that would explain why we can’t reproduce with Deadline 7.