Source of launcher "Scheduling" error?

nrusch · February 17, 2017, 11:58pm

We’re on 8.0.12.4, and today I came across one node (so far) where the slave was running as expected, but the launcher was MIA. Checking the logs, the last entry in the launcher log (2 weeks ago) was:

Launcher Scheduling - Error querying idle time: The process must exit before getting the requested information.

We don’t use power management or slave scheduling, and the slave in question does not have any idle detection overrides, so I’m confused about what that message means. Can anyone shed any light here? If the launcher is prone to unprovoked suicide, I’d like to know why.

Thanks.

eamsler · February 20, 2017, 5:30pm

Well, that’s odd.

Near as I can tell, it has something to do with running external apps which we use for idle time tracking on at least Mac OS X (I can’t remember how we grab it on Linux offhand).

Internally, that error seems to be thrown by Process.ExitCode and the Internet is full of various examples of it happening, but I’m not sure if that error could really take down the Launcher.

Do you have any other examples of Launchers that aren’t running anymore? It’d be good to have more data points to go on.

nrusch · February 20, 2017, 6:26pm

Nope, only one occurrence so far. We have an alarm set up to trigger if the launcher dies though, so if it happens again I’ll be able to tell pretty quickly.

eamsler · February 21, 2017, 7:56pm

Perfect! To be open here, I have seen cases where the Launcher has stopped logging for reasons I haven’t been able to pinpoint. My assumption is that a thread may have died for some reason (so far I’ve only seen this on Mac if I recall correctly) and none of the other apps noticed. It might be more valuable to check the last modified times of the logs, or if the logs aren’t created at all.

If you ever see any get stuck, we’ll have to figure out how to get a memory dump there.

nrusch · February 21, 2017, 8:32pm

Well, the launcher process was definitely gone… this wasn’t a case of the process just ceasing to log any messages.

nrusch · March 8, 2017, 5:58pm

OK, this “disappearing launcher” issue has now hit most of our farm over the last week. It appears that the initial error message I posted is not an indicator of this issue, as none of the other machines I’ve looked at have any launcher logs from around when they seem to have died.

In all cases, the slave is still running fine.

nrusch · March 8, 2017, 6:08pm

This is on CentOS 7.3 now, by the way.

eamsler · March 9, 2017, 7:42pm

Did the ABRT record anything? Infos here:
wiki.centos.org/TipsAndTricks/ABRT

It’s been more than a year since I’ve gone deep into debugging Mono crashes, but that’ll be a good place to start assuming it’s installed and working.

nrusch · March 9, 2017, 7:58pm

We do not have ABRT installed on our CentOS image.

eamsler · March 10, 2017, 3:26pm

Anything that’d log crashes? We bumped Mono up a version for 9.0 because of SSL/TLS support, and given how well Xamarin (and their parent Microsoft) has been managing Mono lately I’d really like to know how it handles things. The one we’re shipping with now is 4.6.2, so we can always dabble with transplanting that version on a few machines to see if it improves things. The same trick of installing it and updating the “mono” symlink in the “bin” folder still works.

nrusch · March 10, 2017, 6:21pm

Nothing in the Deadline logs, syslog, or dmesg.

The slaves that get hit by this seem to be running for between 1 and 2 weeks, which is easy to confirm, since these machines were all reimaged within that time period. Thus, this seems like something that should be reproducible in a pretty normal test environment.