8.0.10.4: OSX slave house cleaning issues

Hey guys,

I’m in the process of deploying 8.0 in order to upgrade us from 7.2. As part of that process, I installed the client on one of our OSX conversion nodes last week and left it over the weekend. The node is running Darwin kernel 14.5.0 (OSX Yosemite, I think), and is configured to run 4 slaves. The repository settings are configured to allow slaves to perform the various house cleaning tasks if Pulse is not running, and there was no Pulse running at that point.

When I came in today, I noticed that all four of the slaves’ status fields said “House Cleaning” at all times, and when I checked their logs, I saw a lot of messages like this:

2016-10-24 08:22:57: Error running pending job scan process: Process timed out after 1800 seconds without an update. (FranticX.Processes.ManagedProcessAbort) 2016-10-24 08:53:09: Error running housecleaning process: Process timed out after 1800 seconds without an update. (FranticX.Processes.ManagedProcessAbort) 2016-10-24 09:23:26: Error running repository repair process: Process timed out after 1800 seconds without an update. (FranticX.Processes.ManagedProcessAbort) 2016-10-24 09:53:36: Error running housecleaning process: Process timed out after 1800 seconds without an update. (FranticX.Processes.ManagedProcessAbort)

I then tried to stop the slaves remotely, and they did shut down, but I still checked the logs on a hunch and found this:

2016-10-24 10:08:49: Info Thread - requesting slave info thread quit. 2016-10-24 10:08:58: Info Thread - shutdown complete 2016-10-24 10:09:20: Info Thread: Stopped 2016-10-24 10:09:20: Scheduler Thread: ShuttingDown / Waiting 2016-10-24 10:09:20: Render Threads: 2016-10-24 10:09:20: Forcing shutdown. 2016-10-24 10:09:20: Exception Details 2016-10-24 10:09:20: Exception -- One or more threads failed to quit successfully. 2016-10-24 10:09:20: Exception.HResult: -2146233088 2016-10-24 10:09:20: Exception.Data: ( ) 2016-10-24 10:09:20: Exception.StackTrace: 2016-10-24 10:09:20: (null) 2016-10-24 10:09:20: Slave - Final cleanup 2016-10-24 10:09:20: Could not query process information because: (System.Threading.ThreadAbortException) 2016-10-24 10:09:20: WARNING: an error occurred while trying to kill the process tree: (System.Threading.ThreadAbortException) 2016-10-24 10:09:20: Error running housecleaning process: (System.Threading.ThreadAbortException) 2016-10-24 10:09:20: Scheduler Thread - shutdown complete 2016-10-24 10:09:29: Slave - Shutdown complete

There were also four orphaned deadlinecommand processes left running on the machine:

mono-sgen /Applications/Thinkbox/Deadline8/Resources/deadlinecommand.exe -RunCommandForRepository Direct /Volumes/luma/_globalAssets/production/deadline/repo-mel-8 -DoPendingJobScan True False True unrecognized mono-sgen /Applications/Thinkbox/Deadline8/Resources/deadlinecommand.exe -RunCommandForRepository Direct /Volumes/luma/_globalAssets/production/deadline/repo-mel-8 -DoPendingJobScan True False True unrecognized mono-sgen /Applications/Thinkbox/Deadline8/Resources/deadlinecommand.exe -RunCommandForRepository Direct /Volumes/luma/_globalAssets/production/deadline/repo-mel-8 -DoHouseCleaning True False True mono-sgen /Applications/Thinkbox/Deadline8/Resources/deadlinecommand.exe -RunCommandForRepository Direct /Volumes/luma/_globalAssets/production/deadline/repo-mel-8 -DoHouseCleaning True False True

After starting up our Pulse instance and then restarting the slaves, they seem to idle as expected without the same errors. In practice, we should always have a Pulse running, and if it ever goes down, it’s far more likely that a Linux slave would decide to perform the house cleaning operations. However, I still think this is worth reporting as there are clearly still some issues present.

I’m on this version here with house cleaning happening fairly happily and regularly:

14.5.0 Darwin Kernel Version 14.5.0: Mon Aug 29 21:14:16 PDT 2016; root:xnu-2782.50.6~1/RELEASE_X86_64 x86_64

Here are my house cleaning settings:

What happens when you run that Deadline Command action manually? My hunch is that that’s where we need to concentrate our efforts.

Here’s the description of those boolean values:

DoHouseCleaning
  Performs house cleaning operations.
    [Check Last Time]        Optional. If true, house cleaning will only be
                             performed if the appropriate amount of time has
                             passed since the last check (true/false). This
                             is always True if a mode is specified.
    [Verbose]                Optional. If logging should be enabled
                             (true/false).
    [Log Repo History]       Optional. If a history entry should be logged
                             (true/false). This is ignored if a mode other
                             than 'All' is specified.
    [Mode]                   Optional. If not specified, all housecleaning
                             operations will be performed. Available modes
                             are: All, CleanupCompletedJobs,
                             DeleteUnsubmittedJobs, PurgeDeletedJobs,
                             PurgeOldAuxFiles, PurgeOldJobReports,
                             CleanupObsoleteSlaves, PurgeOldSlaveReports,
                             PurgeJobMachineLimits, CleanupRepoValidations,
                             PurgeOldStatistics, CleanupDatabase

I have the house cleaning settings set to run in a separate process, which is probably the source of the problem. Running the command manually several times “works” (in that I don’t get any error messages), but it does hang and fail to exit sometimes.

TL;DR: Can you check for and remove everything but the IP address of you server in the connection.ini file in your Repo’s “settings” folder?

Had this problem with another client actually. Though, their problem was somewhat more vanilla and we’re going to revisit fixing it for 9.0.

There’s an issue with Mono and background threads that affects the MongoDB middleware we use from the 10gen guys. Basically, the middleware makes one background thread per possible server listed in the connection.ini file and tries to connect to it. Those background threads are supposed to exit when the main thread does, but in Mono’s case they don’t. So, if there’s one server that it’s taking unusually long to fail on it could hold up everything. For me, the longest was only 20 seconds. For one client, it lasted much longer.

No dice. The repo config has a single hostname in it for connections, and I replaced that with the server’s IP address and then tried running one of the DeadlineCommand calls to do house cleaning. It hung on the first try, and I killed it after 10 minutes.

Hmm… Well, I’ll do some tests here.

I did notice that you or Deadline are likely using the system’s Mono here with this command:

mono-sgen /Applications/Thinkbox/Deadline8/Resources/deadlinecommand.exe -RunCommandForRepository Direct /Volumes/luma/_globalAssets/production/deadline/repo-mel-8 -DoHouseCleaning True False True

Is mono in your path? I ended up running this:

Mobile-029:Resources edwin.amsler$ ./Frameworks/Mono.framework/Versions/3.8.0/bin/mono-sgen /Applications/Thinkbox/Deadline8/Resources/deadlinecommand.exe -RunCommandForRepository Direct /Applications/Thinkbox/DeadlineRepository8 -DoHouseCleaning True False True

At this point… I guess could you run the whole thing through dtruss? That’ll show the kernel calls like stracewould. I haven’t yet figured out how to transplant the Mono debugger infrastructure to Mac, or we could trace that program… Actually, does this house cleaning fail over on a Linux machine?

Nah, I just trimmed the head of the command for sanity. There is no other mono installation on these machines.

Honestly, this isn’t really a problem for us, as our Linux Pulse doesn’t seem to have these issues. I just wanted to report this as a Deadline bug.

Aww. Alright. I’m a bit busy these days anyway, but if anyone has this issue in January, I’ll be game to do a really deep dive on it. Thanks for reporting Nathan.