AWS Thinkbox Discussion Forums

OS X Volume Resolution Issue

Slave:
OSX 10.10.5
Deadline 9.0.2.0

Repository location:
AFP network volume

Our team recently upgraded from 8.0.4.1 to 9.0.2.0 and have found an issue we haven’t had to deal with in the previous version.

Since upgrading to 9.0.2.0 we’ve experienced an issue where a workstation running the slave loses a connection to the shared AFP repository volume and either:
-Does not properly unmount and leaves behind a path on the system drive to a jobs directory that Deadline continues to work away in before registering that it has unmounted
-Or the slave is aware of the unmounted volume and still writes a path to a jobs directory on the system drive as some sort of failsafe for the work of a running task

We have not upgraded any other aspects of the OS nor have we made any changes to the network share system since upgrading Deadline and as I mentioned, we were not experiencing this issue with 8.0.4.1.

Resolving the incorrectly written directories ultimately is a relatively easy fix except that the users of these workstations are:
-Not administrative level users and thus cannot remove the incorrect folder themselves
-Typically unaware of what is happening and thus complicate the issue further

As you can imagine, once the a directory exists in /Volumes and someone attempts to mount the shared volume again it results in:
/Volumes/REPO/ - An empty path to a jobs directory filled with empty job string directories
/Volumes/REPO-1/ - The actual share point

Frankly, I’m also worried that someone less aware will accidentally delete the contents of the actual repository volume when trying to resolve the issue :unamused:

Typically users will not realize the issue has happened, mount the shared volume again, creating a new incorrect path, eg /Volumes/REPO-1/, which obviously creates conflicts for the repository location on the workstation and often results in the slave subsequently being unable to even launch at all until it is reinstalled.

In previous versions if the path cannot be found it would at least allow you to launch the slave and change the repo path to REPO-1, but I can’t even get it to launch once the user does this. Ultimately, I’m less worried about this issue and would rather focus on what new factor is causing the slave to write these directories after the volume is no longer present.

I was able to collect the logs from an instance of when the disconnect appears to have registered with the slave, though I’m not certain this is the actual time of the disconnect from the AFP share:

2017-05-22 18:02:00: Skipping repository repair because it is not required at this time 2017-05-22 18:02:00: Skipping house cleaning because it is not required at this time 2017-05-22 18:03:04: Skipping pending job scan because it is not required at this time 2017-05-22 18:03:04: Skipping repository repair because it is not required at this time 2017-05-22 18:03:04: Skipping house cleaning because it is not required at this time 2017-05-22 18:04:08: Skipping pending job scan because it is not required at this time 2017-05-22 18:04:08: Skipping repository repair because it is not required at this time 2017-05-22 18:04:08: Skipping house cleaning because it is not required at this time 2017-05-22 18:04:37: Info Thread - An error occurred while updating the slave's info: Directory '/Volumes/Graphics/[REDACTED]/DeadlineRepository9/events' not found. (Deadline.Plugins.PluginException) 2017-05-22 18:04:37: Exception Details 2017-05-22 18:04:37: PluginException -- Directory '/Volumes/Graphics/[REDACTED]/DeadlineRepository9/events' not found. 2017-05-22 18:04:37: Exception.Data: ( ) 2017-05-22 18:04:37: Exception.TargetSite: Void a(Deadline.Net.DeadlineMessage) 2017-05-22 18:04:37: Exception.Source: deadline 2017-05-22 18:04:37: Exception.HResult: -2146233088 2017-05-22 18:04:37: Exception.StackTrace: 2017-05-22 18:04:37: at Deadline.Events.SandboxedEventManager.a (Deadline.Net.DeadlineMessage A_0) [0x00288] in <992d3db106b7480fa50268b98ded6128>:0 2017-05-22 18:04:37: at Deadline.Events.SandboxedEventManager.OnSlaveInfoUpdated (System.String slaveName, Deadline.Slaves.SlaveInfo slaveInfo, Deadline.Controllers.DataController dataController) [0x00079] in <992d3db106b7480fa50268b98ded6128>:0 2017-05-22 18:04:37: at Deadline.Slaves.Slave.ReportSlaveInfo (System.Boolean collectMachineStats, System.Boolean isShuttingDown) [0x00a2f] in <992d3db106b7480fa50268b98ded6128>:0 2017-05-22 18:04:37: at Deadline.Slaves.SlaveInfoThread.a (System.Boolean firstTime) [0x0003a] in <992d3db106b7480fa50268b98ded6128>:0 2017-05-22 18:04:37: Scheduler Thread - >>> SLAVE LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING 2017-05-22 18:04:44: Scheduler Thread - >>> SLAVE LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING 2017-05-22 18:04:49: Scheduler Thread - >>> SLAVE LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING 2017-05-22 18:04:52: Directory '/Volumes/Graphics/[REDACTED]/DeadlineRepository9/events' not found. 2017-05-22 18:04:56: Scheduler Thread - >>> SLAVE LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING 2017-05-22 18:05:02: Scheduler Thread - >>> SLAVE LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING 2017-05-22 18:05:09: Skipping pending job scan because it is not required at this time 2017-05-22 18:05:09: Skipping repository repair because it is not required at this time 2017-05-22 18:05:09: Skipping house cleaning because it is not required at this time 2017-05-22 18:05:09: Scheduler Thread - Unexpected Error Occurred 2017-05-22 18:05:09: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> 2017-05-22 18:05:09: Exception Details 2017-05-22 18:05:09: PluginException -- Directory '/Volumes/Graphics/[REDACTED]/DeadlineRepository9/events' not found. 2017-05-22 18:05:09: Exception.Data: ( ) 2017-05-22 18:05:09: Exception.TargetSite: Void a(Deadline.Net.DeadlineMessage) 2017-05-22 18:05:09: Exception.Source: deadline 2017-05-22 18:05:09: Exception.HResult: -2146233088 2017-05-22 18:05:09: Exception.StackTrace: 2017-05-22 18:05:09: at Deadline.Events.SandboxedEventManager.a (Deadline.Net.DeadlineMessage A_0) [0x00288] in <992d3db106b7480fa50268b98ded6128>:0 2017-05-22 18:05:09: at Deadline.Events.SandboxedEventManager.OnSlaveIdle (System.String slaveName, Deadline.Controllers.DataController dataController) [0x00071] in <992d3db106b7480fa50268b98ded6128>:0 2017-05-22 18:05:09: at Deadline.Slaves.SlaveSchedulerThread.a (Deadline.Slaves.SlaveStatus slaveStatus) [0x0006a] in <992d3db106b7480fa50268b98ded6128>:0 2017-05-22 18:05:09: <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< 2017-05-22 18:05:09: Scheduler Thread - >>> SLAVE LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING 2017-05-22 18:05:16: Scheduler Thread - >>> SLAVE LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING 2017-05-22 18:05:21: Scheduler Thread - >>> SLAVE LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING 2017-05-22 18:05:29: Scheduler Thread - >>> SLAVE LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING 2017-05-22 18:05:36: Scheduler Thread - >>> SLAVE LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING 2017-05-22 18:05:43: Scheduler Thread - >>> SLAVE LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING 2017-05-22 18:05:51: Scheduler Thread - >>> SLAVE LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING 2017-05-22 18:05:57: Scheduler Thread - >>> SLAVE LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING

It repeats on like that until the next day when this user returns to their workstation and before this excerpt, it was essentially 10 minutes of skipping house cleaning and no jobs for hours before that. So everything looks correct here when it determines that it has lost a connection to the AFP share but the written directories are to jobs so it’s also possible that this happened during a job and it only fully determined that it lost the repository later in the day. I’m just not sure of the timeline.

And the obviously frustrating part…in an effort to quickly resolve the issue, we have deleted the bad job directories from all workstations that this is occurring on and no longer have the modified or created times to compare. Sorry, I’ll attempt to get those moving forward.

But I wanted to first see if there’s any information related to this issue that anyone else is experiencing sooner rather than later.

I bet this is related to the change we made to prevent the Slaves from creating the read/write test files in the Repository. This issue has happened before, but it’s been years since I’ve seen it (maybe Deadline 5 days).

I’ll run this by the core team and see if we can strike a happy medium. The reason for removing it was to help performance for our clients on the 2,000+ node farms. Maybe we can just check if a volume is mounted instead, which would be a local operation.

Update:

I should have asked. Can you provide some directory listings for those unmounted shares? If not here, send them to "support@thinkboxsoftware.com" and reference this forum thread.

Privacy | Site terms | Cookie preferences