I’m in the process of deploying 8.0 in order to upgrade us from 7.2. As part of that process, I installed the client on one of our OSX conversion nodes last week and left it over the weekend. The node is running Darwin kernel 14.5.0 (OSX Yosemite, I think), and is configured to run 4 slaves. The repository settings are configured to allow slaves to perform the various house cleaning tasks if Pulse is not running, and there was no Pulse running at that point.
When I came in today, I noticed that all four of the slaves’ status fields said “House Cleaning” at all times, and when I checked their logs, I saw a lot of messages like this:
2016-10-24 08:22:57: Error running pending job scan process: Process timed out after 1800 seconds without an update. (FranticX.Processes.ManagedProcessAbort)
2016-10-24 08:53:09: Error running housecleaning process: Process timed out after 1800 seconds without an update. (FranticX.Processes.ManagedProcessAbort)
2016-10-24 09:23:26: Error running repository repair process: Process timed out after 1800 seconds without an update. (FranticX.Processes.ManagedProcessAbort)
2016-10-24 09:53:36: Error running housecleaning process: Process timed out after 1800 seconds without an update. (FranticX.Processes.ManagedProcessAbort)
I then tried to stop the slaves remotely, and they did shut down, but I still checked the logs on a hunch and found this:
2016-10-24 10:08:49: Info Thread - requesting slave info thread quit.
2016-10-24 10:08:58: Info Thread - shutdown complete
2016-10-24 10:09:20: Info Thread: Stopped
2016-10-24 10:09:20: Scheduler Thread: ShuttingDown / Waiting
2016-10-24 10:09:20: Render Threads:
2016-10-24 10:09:20: Forcing shutdown.
2016-10-24 10:09:20: Exception Details
2016-10-24 10:09:20: Exception -- One or more threads failed to quit successfully.
2016-10-24 10:09:20: Exception.HResult: -2146233088
2016-10-24 10:09:20: Exception.Data: ( )
2016-10-24 10:09:20: Exception.StackTrace:
2016-10-24 10:09:20: (null)
2016-10-24 10:09:20: Slave - Final cleanup
2016-10-24 10:09:20: Could not query process information because: (System.Threading.ThreadAbortException)
2016-10-24 10:09:20: WARNING: an error occurred while trying to kill the process tree: (System.Threading.ThreadAbortException)
2016-10-24 10:09:20: Error running housecleaning process: (System.Threading.ThreadAbortException)
2016-10-24 10:09:20: Scheduler Thread - shutdown complete
2016-10-24 10:09:29: Slave - Shutdown complete
There were also four orphaned deadlinecommand processes left running on the machine:
mono-sgen /Applications/Thinkbox/Deadline8/Resources/deadlinecommand.exe -RunCommandForRepository Direct /Volumes/luma/_globalAssets/production/deadline/repo-mel-8 -DoPendingJobScan True False True unrecognized
mono-sgen /Applications/Thinkbox/Deadline8/Resources/deadlinecommand.exe -RunCommandForRepository Direct /Volumes/luma/_globalAssets/production/deadline/repo-mel-8 -DoPendingJobScan True False True unrecognized
mono-sgen /Applications/Thinkbox/Deadline8/Resources/deadlinecommand.exe -RunCommandForRepository Direct /Volumes/luma/_globalAssets/production/deadline/repo-mel-8 -DoHouseCleaning True False True
mono-sgen /Applications/Thinkbox/Deadline8/Resources/deadlinecommand.exe -RunCommandForRepository Direct /Volumes/luma/_globalAssets/production/deadline/repo-mel-8 -DoHouseCleaning True False True
After starting up our Pulse instance and then restarting the slaves, they seem to idle as expected without the same errors. In practice, we should always have a Pulse running, and if it ever goes down, it’s far more likely that a Linux slave would decide to perform the house cleaning operations. However, I still think this is worth reporting as there are clearly still some issues present.
What happens when you run that Deadline Command action manually? My hunch is that that’s where we need to concentrate our efforts.
Here’s the description of those boolean values:
DoHouseCleaning
Performs house cleaning operations.
[Check Last Time] Optional. If true, house cleaning will only be
performed if the appropriate amount of time has
passed since the last check (true/false). This
is always True if a mode is specified.
[Verbose] Optional. If logging should be enabled
(true/false).
[Log Repo History] Optional. If a history entry should be logged
(true/false). This is ignored if a mode other
than 'All' is specified.
[Mode] Optional. If not specified, all housecleaning
operations will be performed. Available modes
are: All, CleanupCompletedJobs,
DeleteUnsubmittedJobs, PurgeDeletedJobs,
PurgeOldAuxFiles, PurgeOldJobReports,
CleanupObsoleteSlaves, PurgeOldSlaveReports,
PurgeJobMachineLimits, CleanupRepoValidations,
PurgeOldStatistics, CleanupDatabase
I have the house cleaning settings set to run in a separate process, which is probably the source of the problem. Running the command manually several times “works” (in that I don’t get any error messages), but it does hang and fail to exit sometimes.
TL;DR: Can you check for and remove everything but the IP address of you server in the connection.ini file in your Repo’s “settings” folder?
Had this problem with another client actually. Though, their problem was somewhat more vanilla and we’re going to revisit fixing it for 9.0.
There’s an issue with Mono and background threads that affects the MongoDB middleware we use from the 10gen guys. Basically, the middleware makes one background thread per possible server listed in the connection.ini file and tries to connect to it. Those background threads are supposed to exit when the main thread does, but in Mono’s case they don’t. So, if there’s one server that it’s taking unusually long to fail on it could hold up everything. For me, the longest was only 20 seconds. For one client, it lasted much longer.
No dice. The repo config has a single hostname in it for connections, and I replaced that with the server’s IP address and then tried running one of the DeadlineCommand calls to do house cleaning. It hung on the first try, and I killed it after 10 minutes.
At this point… I guess could you run the whole thing through dtruss? That’ll show the kernel calls like stracewould. I haven’t yet figured out how to transplant the Mono debugger infrastructure to Mac, or we could trace that program… Actually, does this house cleaning fail over on a Linux machine?
Aww. Alright. I’m a bit busy these days anyway, but if anyone has this issue in January, I’ll be game to do a really deep dive on it. Thanks for reporting Nathan.