How to handle network connection drop for repository

I have a Mac Mini that serves the repository and also acts as one of the nodes. Occasionally it seems to wedge (the GUI fails due to driver misfires, thanks Intel!). This means that I have to powercycle the repository - this machine has been flaky across memory modules, OS revisions and so on. It seems that the graphics drivers are rubbish and I’ve not found a way to restart the entire GUI without killing the machine outright. The other nodes, meanwhile, keep number crunching, but spit out lots of error information such as that below.

When the repository machine comes back up, I can remount the repository under the same path with quick use of ‘rm’ to remove the ‘Phil’ entry that is left lurking in /Volumes and the ‘connect to file server’. That seems to work fine, and the nodes write out their frames without issue.

I am wondering, though, about the requeue behaviour because it seems that the frames are not marked as completed, but are re-queued in DL. Is this configurable/avoidable, or am I imagining the requeue action? :slight_smile:

Here’s the console output from a node when the repository falls over :

Exception Details
JobDeletedException – job was deleted, jobDirectory: /Volumes/Phil/Applications/DeadlineRepository/jobs/999_050_999_1ae37186/999_050_999_1ae37186.job
JobDeletedException.JobDirectory: /Volumes/Phil/Applications/DeadlineRepository/jobs/999_050_999_1ae37186/999_050_999_1ae37186.job
Exception.Source: deadline
Exception.TargetSite: Boolean RefreshJob(Deadline.Jobs.Job ByRef, Boolean)
Exception.Data: ( )
Exception.StackTrace:
at Deadline.Storage.JobStorage.RefreshJob (Deadline.Jobs.Job& job, Boolean forceRefresh) [0x00000] in :0
at Deadline.Storage.JobStorage.RefreshJob (Deadline.Jobs.Job& job) [0x00000] in :0
at Deadline.Controllers.JobController.RequeueTask (Deadline.Jobs.Job job, Deadline.Jobs.Task task, Boolean refreshJob) [0x00000] in :0
at Deadline.Controllers.DeadlineController.RequeueTask (Deadline.Jobs.Job job, Deadline.Jobs.Task task, Boolean refreshJob) [0x00000] in :0
at Deadline.Slaves.SlaveSchedulerThread.RequeueTask (Deadline.Jobs.Task task) [0x00000] in :0

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Scheduler Thread - can not continue until this operation is completed successfully
Scheduler Thread - waiting 20 seconds before retrying…
Slave - UnauthorizedAccessException: Failed to update slaveInfo: Access to the path “/Volumes/Phil/Applications” is denied. For more information, see software.primefocusworld.com/sof … or_message.
Scheduler Thread - exception occurred while trying to requeue task.

Exception Details
JobDeletedException – job was deleted, jobDirectory: /Volumes/Phil/Applications/DeadlineRepository/jobs/999_050_999_1ae37186/999_050_999_1ae37186.job
JobDeletedException.JobDirectory: /Volumes/Phil/Applications/DeadlineRepository/jobs/999_050_999_1ae37186/999_050_999_1ae37186.job
Exception.Source: deadline
Exception.TargetSite: Boolean RefreshJob(Deadline.Jobs.Job ByRef, Boolean)
Exception.Data: ( )
Exception.StackTrace:
at Deadline.Storage.JobStorage.RefreshJob (Deadline.Jobs.Job& job, Boolean forceRefresh) [0x00000] in :0
at Deadline.Storage.JobStorage.RefreshJob (Deadline.Jobs.Job& job) [0x00000] in :0
at Deadline.Controllers.JobController.RequeueTask (Deadline.Jobs.Job job, Deadline.Jobs.Task task, Boolean refreshJob) [0x00000] in :0
at Deadline.Controllers.DeadlineController.RequeueTask (Deadline.Jobs.Job job, Deadline.Jobs.Task task, Boolean refreshJob) [0x00000] in :0
at Deadline.Slaves.SlaveSchedulerThread.RequeueTask (Deadline.Jobs.Task task) [0x00000] in :0

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Scheduler Thread - can not continue until this operation is completed successfully
Scheduler Thread - waiting 20 seconds before retrying…
Slave - UnauthorizedAccessException: Failed to update slaveInfo: Access to the path “/Volumes/Phil/Applications” is denied. For more information, see software.primefocusworld.com/sof … or_message.
Scheduler Thread - exception occurred while trying to requeue task.

Exception Details
JobDeletedException – job was deleted, jobDirectory: /Volumes/Phil/Applications/DeadlineRepository/jobs/999_050_999_1ae37186/999_050_999_1ae37186.job
JobDeletedException.JobDirectory: /Volumes/Phil/Applications/DeadlineRepository/jobs/999_050_999_1ae37186/999_050_999_1ae37186.job
Exception.Source: deadline
Exception.TargetSite: Boolean RefreshJob(Deadline.Jobs.Job ByRef, Boolean)
Exception.Data: ( )
Exception.StackTrace:
at Deadline.Storage.JobStorage.RefreshJob (Deadline.Jobs.Job& job, Boolean forceRefresh) [0x00000] in :0
at Deadline.Storage.JobStorage.RefreshJob (Deadline.Jobs.Job& job) [0x00000] in :0
at Deadline.Controllers.JobController.RequeueTask (Deadline.Jobs.Job job, Deadline.Jobs.Task task, Boolean refreshJob) [0x00000] in :0
at Deadline.Controllers.DeadlineController.RequeueTask (Deadline.Jobs.Job job, Deadline.Jobs.Task task, Boolean refreshJob) [0x00000] in :0
at Deadline.Slaves.SlaveSchedulerThread.RequeueTask (Deadline.Jobs.Task task) [0x00000] in :0

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
Scheduler Thread - can not continue until this operation is completed successfully
Scheduler Thread - waiting 20 seconds before retrying…
Scheduler Thread - creating requeue report due to lost network connection

Exception Details
UnauthorizedAccessException – Access to the path “/Volumes/Phil/Applications” is denied.
Exception.Source: mscorlib
Exception.TargetSite: System.IO.DirectoryInfo CreateDirectoriesInternal(System.String)
Exception.Data: ( )
Exception.StackTrace:
at System.IO.Directory.CreateDirectoriesInternal (System.String path) [0x00000] in :0
at System.IO.Directory.CreateDirectory (System.String path) [0x00000] in :0
at System.IO.DirectoryInfo.Create () [0x00000] in :0
at (wrapper remoting-invoke-with-check) System.IO.DirectoryInfo:Create ()
at System.IO.Directory.CreateDirectoriesInternal (System.String path) [0x00000] in :0
at System.IO.Directory.CreateDirectory (System.String path) [0x00000] in :0
at System.IO.DirectoryInfo.Create () [0x00000] in :0
at (wrapper remoting-invoke-with-check) System.IO.DirectoryInfo:Create ()
at System.IO.Directory.CreateDirectoriesInternal (System.String path) [0x00000] in :0
at System.IO.Directory.CreateDirectory (System.String path) [0x00000] in :0
at Deadline.Storage.DeadlineStorage.GetRepositoryDateTime (System.String hostNameOrIpAddress, Int32 timeoutMilliseconds) [0x00000] in :0
at Deadline.Storage.Caches.DeadlineStorageCache.GetRepositoryDateTime (System.String hostNameOrIpAddress, Int32 timeoutMilliseconds) [0x00000] in :0
at Deadline.Controllers.RepositoryController.GetRepositoryDateTime () [0x00000] in :0
at Deadline.Controllers.DeadlineController.GetRepositoryDateTime () [0x00000] in :0
at Deadline.Slaves.SlaveSchedulerThread.CheckTaskStatuses () [0x00000] in :0
at Deadline.Slaves.SlaveSchedulerThread.RenderTasks () [0x00000] in :0
at Deadline.Slaves.SlaveSchedulerThread.ThreadMain () [0x00000] in :0

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

You wouldn’t happen to have the full log still, would you? You’re right that these messages indicate that the slave is trying to requeue the task, but further up the log (probably shortly after the repository connection went down), the slave should be stating why it initially thought it should requeue the task. If you can send us this log, we’ll take a look and see if we can figure out what is going on here.

Thanks!

  • Ryan

I don’t because in the rush to recover things, I accidentally cleared the scrollback (Terminal’s clear scrollback hotkey and Finder’s connect to server hotkey combinations are identical, making this a frequent irritation).

I can send you what I have. I could also easily reproduce this by dropping a network connection on a running job - it’s 100% reproducible as a consequence of an extended inability to reach the repository. This is by no means the first time I’ve seen this behaviour, but DL’s response is to try and straddle the two available options :

  1. Keep running and hope the repository comes back up
  2. Kill the local job and either idle or shutdown DL

I’d obviously prefer 1) to avoid losing hours of render time, and DL does indeed seem to do this. It has two issues in its current approach :

  1. The desire to requeue the job - that seems to indicate that DL has chosen option 2) above rather than 1)
  2. It recreates the repository mountpoint when trying to reach the repository. This is unhelpful :

If the repository is at /Volumes/Phil, and the repository vanishes, OS X unmounts the whole shebang. DL then comes along and recreates the Phil folder in /Volumes. This is a bad thing because when the user tries to sort out the problem and reconnect the repository, OS X mounts it at /Volumes/Phil-1, but the desktop icon shows Phil. When DL keeps complaining about the missing repository, the user is puzzled.

The workaround I’ve used up to now is to be very speedy in :

  1. rm-ing the /Volumes/Phil folder
  2. Mounting the fileshare.

If you’re too slow, the problem is recreated in front of your eyes.

It would also be enormously helpful if the output could also contain the task name for the job. That would assist in recovery.

Thanks for the info. We’ll definitely try to reproduce this, and handle the situation better. The code that handles this situation was originally based on using a Windows share, which explains why it does some weird things like the folder creation on the mac.

This problem should be fixed for the next Deadline release. We’ve made some changes with regards to when Deadline will create folders in the repository, and this appears at first glance to fix the problem of Deadline recreating the repository root folder (“Phil” in your case).

Cheers,

  • Ryan

Any pre-release build available? This is a bit of a PITA here.

Unfortunately, not yet. However, there will be a beta program for the next version (probably within the next couple of months).

Cheers,

  • Ryan

I wanted to follow-up on this to ask if the requeue behaviour has also been reconsidered or made configurable in this situation.

The current behavior is to keep working on the current task until the repository comes back up. That’s why I had asked to see the log from before, because there must have been a reason why Deadline was trying to requeue the task (ie: an error occurred during rendering). We tested this here before and after making the changes that prevent Deadline from recreating the repository root folder, and it worked as expected in both cases.

Cheers,

  • Ryan

Here’s output from the console of a recent event like this. The frame was requeued as the repository connection was remade, cancelling the running task and restarting it in the process. :frowning:
deadlinefailure.zip (267 KB)

Hi Phil,

Thanks for the log. I think the issue here is due to the fact that Deadline currently tries to recreate the repository root. Part of the check that Deadline does to determine if a task has been requeued or if the connection was lost is to see if the Deadline root folder still exists. If it does, it just assumes the task was requeued, which is why it drops the task.

So by fixing the repository root recreation, this issue should be resolved in the next release as well.

Cheers,

  • Ryan