AWS Thinkbox Discussion Forums

Source of launcher "Scheduling" error?

We’re on 8.0.12.4, and today I came across one node (so far) where the slave was running as expected, but the launcher was MIA. Checking the logs, the last entry in the launcher log (2 weeks ago) was:

Launcher Scheduling - Error querying idle time: The process must exit before getting the requested information.

We don’t use power management or slave scheduling, and the slave in question does not have any idle detection overrides, so I’m confused about what that message means. Can anyone shed any light here? If the launcher is prone to unprovoked suicide, I’d like to know why.

Thanks.

:neutral_face: Well, that’s odd.

Near as I can tell, it has something to do with running external apps which we use for idle time tracking on at least Mac OS X (I can’t remember how we grab it on Linux offhand).

Internally, that error seems to be thrown by Process.ExitCode and the Internet is full of various examples of it happening, but I’m not sure if that error could really take down the Launcher.

Do you have any other examples of Launchers that aren’t running anymore? It’d be good to have more data points to go on.

Nope, only one occurrence so far. We have an alarm set up to trigger if the launcher dies though, so if it happens again I’ll be able to tell pretty quickly.

Perfect! To be open here, I have seen cases where the Launcher has stopped logging for reasons I haven’t been able to pinpoint. My assumption is that a thread may have died for some reason (so far I’ve only seen this on Mac if I recall correctly) and none of the other apps noticed. It might be more valuable to check the last modified times of the logs, or if the logs aren’t created at all.

If you ever see any get stuck, we’ll have to figure out how to get a memory dump there.

Well, the launcher process was definitely gone… this wasn’t a case of the process just ceasing to log any messages.

OK, this “disappearing launcher” issue has now hit most of our farm over the last week. It appears that the initial error message I posted is not an indicator of this issue, as none of the other machines I’ve looked at have any launcher logs from around when they seem to have died.

In all cases, the slave is still running fine.

This is on CentOS 7.3 now, by the way.

Did the ABRT record anything? Infos here:
wiki.centos.org/TipsAndTricks/ABRT

It’s been more than a year since I’ve gone deep into debugging Mono crashes, but that’ll be a good place to start assuming it’s installed and working.

We do not have ABRT installed on our CentOS image.

Anything that’d log crashes? We bumped Mono up a version for 9.0 because of SSL/TLS support, and given how well Xamarin (and their parent Microsoft) has been managing Mono lately I’d really like to know how it handles things. The one we’re shipping with now is 4.6.2, so we can always dabble with transplanting that version on a few machines to see if it improves things. The same trick of installing it and updating the “mono” symlink in the “bin” folder still works.

Nothing in the Deadline logs, syslog, or dmesg.

The slaves that get hit by this seem to be running for between 1 and 2 weeks, which is easy to confirm, since these machines were all reimaged within that time period. Thus, this seems like something that should be reproducible in a pretty normal test environment.

Privacy | Site terms | Cookie preferences