AWS Thinkbox Discussion Forums

Connection Server Error

Hello, I just tried to use the external Nuke Submitter to submit a simple test job to a worker on our farm for the first time. We have RCS configured (for the AWS Portal for which we could get infrastructure running) which should be working afaict.

The first task hangs on Waiting to Start for a couple of minutes before dumping this to the console.

2024-01-17 09:15:42:  Running script NukeSubmission (/home/sdugaro/Thinkbox/Deadline10/cache/q3UNeMyV11ydp4KDU3ru0w0QmY/scripts/Submission/NukeSubmission.py)
2024-01-17 09:18:44:  Error occurred while reloading network settings: Connection Server error: GET https://192.168.X.XX:4433/db/settings/network?invalidateCache=true returned "One or more errors occurred. (Connection refused (192.168.X.XX:4433))" (System.Net.WebException)
2024-01-17 09:18:44:  Error occurred while updating Worker cache: Connection Server error: POST https://192.168.X.XX:4433/db/slaves/modified?transactionID=65dad542-3fca-4c43-82bc-36d3d4597770 returned "One or more errors occurred. (Connection refused (192.168.X.XX:4433))" (System.Net.WebException)
2024-01-17 09:18:44:  Error occurred while updating licenseForwarder cache: Connection Server error: POST https://192.168.X.XX:4433/db/licenseforwarders/modified returned "One or more errors occurred. (An error occurred while sending the request.)" (System.Net.WebException)
2024-01-17 09:18:44:     at Deadline.StorageDB.Proxy.Utils.ProxyUtils.HandleException(Exception e, NetworkManager manager, String server, Int32 port, String certificatePath)
2024-01-17 09:18:44:     at Deadline.StorageDB.Proxy.ProxyLicenseForwarderStorage.GetModifiedLicenseForwarders(LicenseForwarderInfoSettings[]& modifiedLicenseForwarders, String[]& deletedLicenseForwarderIds, Nullable`1 lastSettingsAutoUpdate, Nullable`1 lastInfoAutoUpdate, Nullable`1 lastDeletionAutoUpdate)
2024-01-17 09:18:44:     at Deadline.StorageDB.LicenseForwarderStorage.a(Object lw)
2024-01-17 09:18:44:  ---------- Inner Stack Trace (System.Net.Sockets.SocketException) ----------
2024-01-17 09:18:44:  Error occurred while updating job cache: Connection Server error: POST https://192.168.X.XX:4433/db/jobs/modified?transactionID=00000000-0000-0000-0000-000000000000&invalidateCache=true returned "One or more errors occurred. (An error occurred while sending the request.)" (System.Net.WebException)
2024-01-17 09:18:44:     at Deadline.StorageDB.Proxy.Utils.ProxyUtils.HandleException(Exception e, NetworkManager manager, String server, Int32 port, String certificatePath)
2024-01-17 09:18:44:     at Deadline.StorageDB.Proxy.ProxyJobStorage.GetModifiedJobs(Job[]& modifiedJobs, String[]& deletedJobIds, Boolean& hasMore, String& transactionID, Nullable`1 batchQueryTime, Nullable`1 deleteQueryTime, Nullable`1 firstBatchFlag)
2024-01-17 09:18:44:     at Deadline.StorageDB.JobStorage.b(Object kp)
2024-01-17 09:18:44:  ---------- Inner Stack Trace (System.Net.Sockets.SocketException) ----------
2024-01-17 09:18:44:  Error occurred while updating RCS cache: Connection Server error: POST https://192.168.X.XX:4433/db/proxyservers/modified returned "One or more errors occurred. (An error occurred while sending the request.)" (System.Net.WebException)
2024-01-17 09:18:44:     at Deadline.StorageDB.Proxy.Utils.ProxyUtils.HandleException(Exception e, NetworkManager manager, String server, Int32 port, String certificatePath)
2024-01-17 09:18:44:     at Deadline.StorageDB.Proxy.ProxyProxyServerStorage.GetModifiedProxyServers(ProxyServerInfoSettings[]& modifiedProxyServers, String[]& deletedProxyServerIds, Nullable`1 lastSettingsAutoUpdate, Nullable`1 lastInfoAutoUpdate, Nullable`1 lastDeletionAutoUpdate)
2024-01-17 09:18:44:     at Deadline.StorageDB.ProxyServerStorage.a(Object mv)
2024-01-17 09:18:44:  ---------- Inner Stack Trace (System.Net.Sockets.SocketException) ----------
2024-01-17 09:18:44:  Error occurred while updating balancer cache: Connection Server error: POST https://192.168.X.XX:4433/db/balancers/modified returned "One or more errors occurred. (Connection refused (192.168.X.XX:4433))" (System.Net.WebException)
2024-01-17 09:18:44:     at Deadline.StorageDB.Proxy.Utils.ProxyUtils.HandleException(Exception e, NetworkManager manager, String server, Int32 port, String certificatePath)
2024-01-17 09:18:44:     at Deadline.StorageDB.Proxy.ProxyBalancerStorage.GetModifiedBalancers(BalancerInfoSettings[]& modifiedBalancers, String[]& deletedBalancerIds, Nullable`1 lastSettingsAutoUpdate, Nullable`1 lastInfoAutoUpdate, Nullable`1 lastDeletionAutoUpdate)
2024-01-17 09:18:44:     at Deadline.StorageDB.BalancerStorage.a(Object ji)
...
2024-01-17 09:20:00:  ERROR: UpdateClient.MaybeSendRequestNow caught an exception: POST https://192.168.X.XX:4433/rcs/v1/update returned "One or more errors occurred. (Connection refused (192.168.X.XX:4433))" (Deadline.Net.Clients.Http.DeadlineHttpRequestException)
2024-01-17 09:20:02:  ERROR: UpdateClient.MaybeSendRequestNow caught an exception: POST https://192.168.X.XX:4433/rcs/v1/update returned "One or more errors occurred. (Connection refused (192.168.X.XX:4433))" 
...
2024-01-17 09:20:31:  The batch request reach the limit of 2 retries
2024-01-17 09:20:38:  The batch request reach the limit of 2 retries
2024-01-17 09:20:38:  Error occurred while reloading network settings: Connection Server error: GET https://192.168.X.XX:4433/db/settings/network?invalidateCache=true returned "One or more errors occurred. (Connection refused (192.168.X.XX:4433))" (System.Net.WebException)
2024-01-17 09:20:38:  The batch request reach the limit of 2 retries
2024-01-17 09:20:44:  Error occurred while updating Deadline AWS Resource Tracker Status label: Connection Server error: GET https://192.168.X.XX:4433/db/dash/dashFleet/health?region=af-south-1 returned "One or more errors occurred. (Connection refused (192.168.X.XX:4433))" (System.Net.WebException)
2024-01-17 09:20:44:     at Deadline.StorageDB.Proxy.Utils.ProxyUtils.HandleException(Exception e, NetworkManager manager, String server, Int32 port, String certificatePath)
2024-01-17 09:20:44:     at Deadline.StorageDB.Proxy.ProxyDashStorage.GetFleetHealthSummary(String awsRegion, String& fleetsHealthReport)
2024-01-17 09:20:44:     at Deadline.StorageDB.DashStorage.UpdateResourceTrackerStatusLabel(Object o)
2024-01-17 09:20:44:  ---------- Inner Stack Trace (System.Net.Sockets.SocketException) ----------
2024-01-17 09:20:44:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
2024-01-17 09:20:44:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
2024-01-17 09:20:44:     at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|277_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
2024-01-17 09:20:44:     at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
2024-01-17 09:20:44:  The batch request reach the limit of 2 retries

The First Task (1001) has no Task Report.
Subsequent tasks (1002,1003) pickup, and their task reports complain about a Licencing error which is fine.
I can Modify Job Properties to put in the license environment variable, then I try to resubmit the job.

It then hangs for another two minutes and pops up this resubmit error dialog

Which looks related as it spits out a similar message in the console

2024-01-17 09:53:20:  An unexpected error occurred while Resubmitting Job:
2024-01-17 09:53:20:  POST https://192.168.X.XX:4433/jobs/65a80b61df00ec740d421340/resubmit returned "One or more errors occurred. (A task was canceled.)" (Deadline.Net.Clients.Http.DeadlineHttpRequestException)
2024-01-17 09:53:20:     at Deadline.Net.Clients.Http.HttpClient.GetResponse(HttpRequestMessage request)
2024-01-17 09:53:20:     at Deadline.Net.Clients.Http.HttpClient.b(HttpRequestMessage blz)
2024-01-17 09:53:20:     at Deadline.Net.Clients.Http.HttpClient.SendRequestForStream(String method, String uri, String contentType, Dictionary`2 headers, HttpContent httpContent)
2024-01-17 09:53:20:     at Deadline.Net.Clients.Http.HttpClient.SendRequest(String method, String uri, String contentType, Dictionary`2 headers, HttpContent httpContent)
2024-01-17 09:53:20:     at Deadline.Net.Clients.Http.HttpClient.Post(String uri, Object body, String contentType, Dictionary`2 headers)
2024-01-17 09:53:20:     at Deadline.Net.Clients.Http.HttpClient.Post[TRequest,TResponse](String uri, TRequest body, String contentType, Dictionary`2 headers)
2024-01-17 09:53:20:     at Deadline.Controllers.RemoteDataController.ResubmitJob(Job job, String jobName, String frameList, Int32 chunkSize, Boolean submitSuspended, Boolean maintenanceJob, Int32 maintenanceJobStartFrame, Int32 maintenanceJobEndFrame)
2024-01-17 09:53:20:     at Deadline.Monitor.WorkItems.ResubmitJobWI.InternalDoWork()
2024-01-17 09:53:20:     at Deadline.Monitor.MonitorWorkItem.DoWork()
2024-01-17 09:53:20:  ---------- Inner Stack Trace (System.Threading.Tasks.TaskCanceledException) ----------
2024-01-17 09:53:20:     at System.Threading.Tasks.Task.GetExceptions(Boolean includeTaskCanceledExceptions)
2024-01-17 09:53:20:     at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
2024-01-17 09:53:20:     at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
2024-01-17 09:53:20:     at Deadline.Net.Clients.Http.HttpClient.GetResponse(HttpRequestMessage request)
2024-01-17 09:53:20:     at Deadline.Net.Clients.Http.HttpClient.b(HttpRequestMessage blz)
2024-01-17 09:53:20:     at Deadline.Net.Clients.Http.HttpClient.SendRequestForStream(String method, String uri, String contentType, Dictionary`2 headers, HttpContent httpContent)
2024-01-17 09:53:20:     at Deadline.Net.Clients.Http.HttpClient.SendRequest(String method, String uri, String contentType, Dictionary`2 headers, HttpContent httpContent)
2024-01-17 09:53:20:     at Deadline.Net.Clients.Http.HttpClient.Post(String uri, Object body, String contentType, Dictionary`2 headers)
2024-01-17 09:53:20:     at Deadline.Net.Clients.Http.HttpClient.Post[TRequest,TResponse](String uri, TRequest body, String contentType, Dictionary`2 headers)
2024-01-17 09:53:20:     at Deadline.Controllers.RemoteDataController.ResubmitJob(Job job, String jobName, String frameList, Int32 chunkSize, Boolean submitSuspended, Boolean maintenanceJob, Int32 maintenanceJobStartFrame, Int32 maintenanceJobEndFrame)
2024-01-17 09:53:20:     at Deadline.Monitor.WorkItems.ResubmitJobWI.InternalDoWork()
2024-01-17 09:53:20:     at Deadline.Monitor.MonitorWorkItem.DoWork()
2024-01-17 09:53:20:     at Deadline.Monitor.MonitorThreadPool.a.f()
2024-01-17 09:53:20:  --- End of stack trace from previous location ---

Looks like port 4433 isn’t open on the RCS machine, based on this:

Connection refused (192.168.X.XX:4433))" (System.Net.WebException)

Take a look at Connection Issues: No connection could be made because target machine actively refused it for troubleshooting steps and an explanation of what’s going on.

thanks,

I did come across this link before…
we did try to reinstall the db already, but are still getting this issue.
We haven’t submitted much as a result so its doubtful that its db corruption.

How do I check if a port is open on linux (centos7.5)?

The Firewall Issues KB article goes to an amazon midway authentication site for which i don’t have credentials.

The mongoDB service is running, it did not fail to start and there is no lock file.
In fact, some of the other tasks ran after that long delay in the first task,
but I could not resubmit the job due to whatever the issue was with the first task.

[root@myDBMACHINE init.d]# systemctl status -l Deadline10db
â—Ź Deadline10db.service - LSB: Mongo DB NoSQL Database.
   Loaded: loaded (/etc/rc.d/init.d/Deadline10db; bad; vendor preset: disabled)
   Active: active (running) since Wed 2024-01-10 16:40:46 PST; 6 days ago
     Docs: man:systemd-sysv-generator(8)
   CGroup: /system.slice/Deadline10db.service
           └─109792 /opt/Thinkbox/DeadlineDatabase10/mongo/application/bin/mongod --config /opt/Thinkbox/DeadlineDatabase10/mongo/data/config.conf

I moved the mongodb log file, submitted a new job, and waited…
The Deadline.Net.Sockets and Deadline.StorageDB.Proxy errors began streaming in my monitor… and no new /opt/Thinkbox/DeadlineDatabase10/mongo/data/logs/log.txt was created. I also touched a log.txt but nothing was ever written to it.

The last entries recorded in the log.txt file i moved out of the way read

{"t":{"$date":"2024-01-17T13:00:37.335-08:00"},"s":"I",  "c":"-",        "id":20883,   "ctx":"conn206072","msg":"Interrupted operation as its client disconnected","attr":{"opId":14658784}}
{"t":{"$date":"2024-01-17T13:00:37.335-08:00"},"s":"I",  "c":"-",        "id":20883,   "ctx":"conn206074","msg":"Interrupted operation as its client disconnected","attr":{"opId":14658785}}
{"t":{"$date":"2024-01-17T13:00:40.068-08:00"},"s":"I",  "c":"STORAGE",  "id":22430,   "ctx":"WTCheckpointThread","msg":"WiredTiger message","attr":{"message":"[1705525240:68404][109792:0x7f5ebcfe3700], WT_SESSION.checkpoint: [WT_VERB_CHECKPOINT_PROGRESS] saving checkpoint snapshot min: 707215, snapshot max: 707215 snapshot count: 0, oldest timestamp: (0, 0) , meta checkpoint timestamp: (0, 0) base write gen: 7"}}
{"t":{"$date":"2024-01-17T13:00:47.966-08:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn206077","msg":"client metadata","attr":{"remote":"192.168.4.23:53796","client":"conn206077","doc":{"driver":{"name":"mongo-csharp-driver|legacy","version":"2.13.3.0"},"os":{"type":"Linux","name":"Linux 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018","architecture":"x86_64","version":"3.10.0-862.el7.x86_64"},"platform":".NET 6.0.14"}}}
{"t":{"$date":"2024-01-17T13:00:47.966-08:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn206078","msg":"client metadata","attr":{"remote":"192.168.4.23:53798","client":"conn206078","doc":{"driver":{"name":"mongo-csharp-driver|legacy","version":"2.13.3.0"},"os":{"type":"Linux","name":"Linux 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018","architecture":"x86_64","version":"3.10.0-862.el7.x86_64"},"platform":".NET 6.0.14"}}}
{"t":{"$date":"2024-01-17T13:00:47.967-08:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn206079","msg":"client metadata","attr":{"remote":"192.168.4.23:53800","client":"conn206079","doc":{"driver":{"name":"mongo-csharp-driver|legacy","version":"2.13.3.0"},"os":{"type":"Linux","name":"Linux 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018","architecture":"x86_64","version":"3.10.0-862.el7.x86_64"},"platform":".NET 6.0.14"}}}
{"t":{"$date":"2024-01-17T13:00:48.000-08:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn206080","msg":"client metadata","attr":{"remote":"192.168.4.23:53802","client":"conn206080","doc":{"driver":{"name":"mongo-csharp-driver|legacy","version":"2.13.3.0"},"os":{"type":"Linux","name":"Linux 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018","architecture":"x86_64","version":"3.10.0-862.el7.x86_64"},"platform":".NET 6.0.14"}}}
{"t":{"$date":"2024-01-17T13:00:48.097-08:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn206081","msg":"client metadata","attr":{"remote":"192.168.4.23:53804","client":"conn206081","doc":{"driver":{"name":"mongo-csharp-driver|legacy","version":"2.13.3.0"},"os":{"type":"Linux","name":"Linux 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018","architecture":"x86_64","version":"3.10.0-862.el7.x86_64"},"platform":".NET 6.0.14"}}}
{"t":{"$date":"2024-01-17T13:00:48.097-08:00"},"s":"I",  "c":"NETWORK",  "id":51800,   "ctx":"conn206082","msg":"client metadata","attr":{"remote":"192.168.4.23:53806","client":"conn206082","doc":{"driver":{"name":"mongo-csharp-driver|legacy","version":"2.13.3.0"},"os":{"type":"Linux","name":"Linux 3.10.0-862.el7.x86_64 #1 SMP Fri Apr 20 16:44:24 UTC 2018","architecture":"x86_64","version":"3.10.0-862.el7.x86_64"},"platform":".NET 6.0.14"}}}
{"t":{"$date":"2024-01-17T13:00:49.182-08:00"},"s":"I",  "c":"-",        "id":20883,   "ctx":"conn206078","msg":"Interrupted operation as its client disconnected","attr":{"opId":14658963}}
{"t":{"$date":"2024-01-17T13:00:49.182-08:00"},"s":"I",  "c":"-",        "id":20883,   "ctx":"conn206080","msg":"Interrupted operation as its client disconnected","attr":{"opId":14658964}}

1 Like

I’m testing against two headless renderfarm nodes.

going into their log files I can see render26 succeeds and creates a deadlineslave_renderthread_0-render26-0000.log which disappeared after the whole job eventually failed.

2024-01-17 13:10:39:  0: Done executing plugin command of type 'Start Job'
2024-01-17 13:10:39:  0: Plugin rendering frame(s): 1003
2024-01-17 13:10:40:  0: Executing plugin command of type 'Render Task'
2024-01-17 13:10:40:  0: INFO: Rendering all enabled write nodes
...
2024-01-17 13:10:40:  0: INFO: Ending Nuke Job
2024-01-17 13:10:40:  0: Done executing plugin command of type 'End Job'
2024-01-17 13:10:42:  Sending kill command to process tree with root process 'deadlinesandbox.exe' with process id 29183
2024-01-17 13:10:44:  Scheduler Thread - Render Thread 0 threw a major error:
2024-01-17 13:10:44:  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2024-01-17 13:10:44:  Exception Details
2024-01-17 13:10:44:  RenderPluginException -- Error: ManagedProcessAbort : Monitored managed process "Nuke" has exited or been terminated.
2024-01-17 13:10:44:     at Deadline.Plugins.DeadlinePlugin.MonitoredManagedProcessExit(String name)
2024-01-17 13:10:44:     at FranticX.Processes.MonitoredManagedProcesses.VerifyMonitoredManagedProcess(String name)
2024-01-17 13:10:44:     at Deadline.Plugins.DeadlinePlugin.VerifyMonitoredManagedProcess(String name) (Python.Runtime.PythonException)
2024-01-17 13:10:44:    File "/var/tmp/render26/plugins/65a83ea73d3ca8da539aa296/Nuke.py", line 280, in RenderTasks
2024-01-17 13:10:44:      self.WritePython(scriptContents)
2024-01-17 13:10:44:    File "/var/tmp/render26/plugins/65a83ea73d3ca8da539aa296/Nuke.py", line 41, in WritePython
2024-01-17 13:10:44:      self.WaitForProcess()
2024-01-17 13:10:44:    File "/var/tmp/render26/plugins/65a83ea73d3ca8da539aa296/Nuke.py", line 47, in WaitForProcess
2024-01-17 13:10:44:      self.VerifyMonitoredManagedProcess( self.ProcessName )
2024-01-17 13:10:44:     at Python.Runtime.Dispatcher.Dispatch(ArrayList args)
2024-01-17 13:10:44:     at __FranticX_GenericDelegate0Dispatcher.Invoke()
2024-01-17 13:10:44:     at Deadline.Plugins.DeadlinePlugin.RenderTasks()
2024-01-17 13:10:44:     at Deadline.Plugins.DeadlinePlugin.DoRenderTasks()
2024-01-17 13:10:44:     at Deadline.Plugins.PluginWrapper.RenderTasks(Task task, String& outMessage, AbortLevel& abortLevel)
2024-01-17 13:10:44:     at Deadline.Plugins.PluginWrapper.RenderTasks(Task task, String& outMessage, AbortLevel& abortLevel)
2024-01-17 13:10:44:  RenderPluginException.Cause: JobError (2)
2024-01-17 13:10:44:  RenderPluginException.Level: Major (1)
2024-01-17 13:10:44:  RenderPluginException.HasSlaveLog: True
2024-01-17 13:10:44:  RenderPluginException.SlaveLogFileName: /var/log/Thinkbox/Deadline10/deadlineslave_renderthread_0-render26-0000.log
2024-01-17 13:10:44:  Exception.TargetSite: Deadline.Slaves.Messaging.PluginResponseMemento d(Deadline.Net.DeadlineMessage, System.Threading.CancellationToken)
2024-01-17 13:10:44:  Exception.Data: ( )
2024-01-17 13:10:44:  Exception.Source: deadline
2024-01-17 13:10:44:  Exception.HResult: -2146233088
2024-01-17 13:10:44:    Exception.StackTrace:
2024-01-17 13:10:44:     at Deadline.Plugins.SandboxedPlugin.d(DeadlineMessage bgt, CancellationToken bgu
2024-01-17 13:10:44:     at Deadline.Plugins.SandboxedPlugin.RenderTask(Task task, CancellationToken cancellationToken
2024-01-17 13:10:44:     at Deadline.Slaves.SlaveRenderThread.c(TaskLogWriter ajy, CancellationToken ajz)

Maybe this has to do with the Sandbox running in python3 instead of python2? If the environment gets merged, it will probably merge some undesirable PYTHONPATH entries.

All the Job and Task reports specify this, but that comes after the nuke process is terminated and is not the same as whats in the console.

=======================================================
Error
=======================================================
Error: ManagedProcessAbort : Monitored managed process "Nuke" has exited or been terminated.
   at Deadline.Plugins.DeadlinePlugin.MonitoredManagedProcessExit(String name)
   at FranticX.Processes.MonitoredManagedProcesses.VerifyMonitoredManagedProcess(String name)
   at Deadline.Plugins.DeadlinePlugin.VerifyMonitoredManagedProcess(String name) (Python.Runtime.PythonException)
  File "/var/tmp/render27/plugins/65a83ea73d3ca8da539aa296/Nuke.py", line 280, in RenderTasks
    self.WritePython(scriptContents)
  File "/var/tmp/render27/plugins/65a83ea73d3ca8da539aa296/Nuke.py", line 41, in WritePython
    self.WaitForProcess()
  File "/var/tmp/render27/plugins/65a83ea73d3ca8da539aa296/Nuke.py", line 47, in WaitForProcess
    self.VerifyMonitoredManagedProcess( self.ProcessName )
   at Python.Runtime.Dispatcher.Dispatch(ArrayList args)
   at __FranticX_GenericDelegate0Dispatcher.Invoke()
   at Deadline.Plugins.DeadlinePlugin.RenderTasks()
   at Deadline.Plugins.DeadlinePlugin.DoRenderTasks()
   at Deadline.Plugins.PluginWrapper.RenderTasks(Task task, String& outMessage, AbortLevel& abortLevel)
   at Deadline.Plugins.PluginWrapper.RenderTasks(Task task, String& outMessage, AbortLevel& abortLevel)

=======================================================
Type
=======================================================
RenderPluginException

=======================================================
Stack Trace
=======================================================
   at Deadline.Plugins.SandboxedPlugin.d(DeadlineMessage bgt, CancellationToken bgu)
   at Deadline.Plugins.SandboxedPlugin.RenderTask(Task task, CancellationToken cancellationToken)
   at Deadline.Slaves.SlaveRenderThread.c(TaskLogWriter ajy, CancellationToken ajz)

All the job and Task reports just point to

The first machine to pick up the job (render27) at the time of job submission had some interesting information in it that look like the socket errors seen in the console (RCS Verbose Logging enabled in the Repository Options | Application Data)

2024-01-17 12:56:05:  WARNING: RCS response timeout with message: POST https://192.168.X.XX:4433/rcs/v1/assigntask returned GatewayTimeout "{"message":"It took too long (>00:01:00s) to acquire lock jobSchedule. This could mean that there are too many workerscompeting for a job or takes a worker too long to pick a task.","code":"requestTimeout"}"
2024-01-17 12:56:05:  [stack trace (maximumDepth=4)]   FranticX.Diagnostics.Trace2.WriteStack    line 0
2024-01-17 12:56:05:     FranticX.Diagnostics.Trace2.Warning    line 0
2024-01-17 12:56:05:     FranticX.Diagnostics.Trace2.Warning    line 0
2024-01-17 12:56:05:     Deadline.Controllers.DataController.DequeueTasks    line 0
2024-01-17 12:56:05:  WARNING: Backing-off exponentially.
2024-01-17 12:56:05:  [stack trace (maximumDepth=4)]   FranticX.Diagnostics.Trace2.WriteStack    line 0
2024-01-17 12:56:05:     FranticX.Diagnostics.Trace2.Warning    line 0
2024-01-17 12:56:05:     FranticX.Diagnostics.Trace2.Warning    line 0
2024-01-17 12:56:05:     Deadline.Controllers.DataController.DequeueTasks    line 0
2024-01-17 12:57:39:  ERROR: Scheduler Thread - Unexpected Error Occurred
2024-01-17 12:57:39:  Scheduler Thread - POST https://192.168.4.23:4433/rcs/v1/assigntask returned "One or more errors occurred. (An error occurred while sending the request.)" (Deadline.Net.Clients.Http.DeadlineHttpRequestException)
2024-01-17 12:57:39:  ERROR: UpdateClient.MaybeSendRequestNow caught an exception: POST https://192.168.4.23:4433/rcs/v1/update returned "One or more errors occurred. (The SSL connection could not be established, see inner exception.)" (Deadline.Net.Clients.Http.DeadlineHttpRequestException)
2024-01-17 12:57:39:  ERROR: Info Thread - An error occurred while updating the Worker's info: Connection Server error: POST https://192.168.4.23:4433/db/slaves/info/save returned "One or more errors occurred. (An error occurred while sending the request.)" (System.Net.WebException)
2024-01-17 12:57:39:  Exception Details
2024-01-17 12:57:39:  SocketException -- Connection reset by peer
2024-01-17 12:57:39:  SocketException.SocketErrorCode: ConnectionReset (10054)
2024-01-17 12:57:39:  SocketException.ErrorCode: 104 (Connection reset by peer)
2024-01-17 12:57:39:  Win32Exception.NativeErrorCode: 104
2024-01-17 12:57:39:  Exception.Data: ( )
2024-01-17 12:57:39:  Exception.HResult: -2147467259
2024-01-17 12:57:39:    Exception.StackTrace:
2024-01-17 12:57:39:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.CreateException(SocketError error, Boolean forAsyncThrow
2024-01-17 12:57:39:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ReceiveAsync(Socket socket, CancellationToken cancellationToken
2024-01-17 12:57:39:     at System.Net.Sockets.NetworkStream.ReadAsync(Memory`1 buffer, CancellationToken cancellationToken
2024-01-17 12:57:39:     at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](TIOAdapter adapter
2024-01-17 12:57:39:     at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine
2024-01-17 12:57:39:     at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](TIOAdapter adapter
2024-01-17 12:57:39:     at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](TIOAdapter adapter, Memory`1 buffer
2024-01-17 12:57:39:     at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state
2024-01-17 12:57:39:     at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread
2024-01-17 12:57:39:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow
2024-01-17 12:57:39:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs _
2024-01-17 12:57:39:     at System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute(
2024-01-17 12:57:39:     at System.Threading.ThreadPoolWorkQueue.Dispatch(
2024-01-17 12:57:39:     at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
2024-01-17 12:57:39:  --- End of stack trace from previous location ---
2024-01-17 12:57:39:  IOException -- Unable to read data from the transport connection: Connection reset by peer.
2024-01-17 12:57:39:  Exception.TargetSite: Void Throw()
2024-01-17 12:57:39:  Exception.Data: ( )
2024-01-17 12:57:39:  Exception.Source: System.Private.CoreLib
2024-01-17 12:57:39:  Exception.HResult: -2146232800
2024-01-17 12:57:39:    Exception.StackTrace:
2024-01-17 12:57:39:     at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](TIOAdapter adapter
2024-01-17 12:57:39:     at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](TIOAdapter adapter, Memory`1 buffer
2024-01-17 12:57:39:     at System.Net.Http.HttpConnection.<CheckUsabilityOnScavenge>g__ReadAheadWithZeroByteReadAsync|45_0(
2024-01-17 12:57:39:     at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
2024-01-17 12:57:39:  HttpRequestException -- An error occurred while sending the request.
2024-01-17 12:57:39:  Exception.TargetSite: Void MoveNext()
2024-01-17 12:57:39:  Exception.Data: ( )
2024-01-17 12:57:39:  Exception.Source: System.Net.Http
2024-01-17 12:57:39:  Exception.HResult: -2146232800
2024-01-17 12:57:39:    Exception.StackTrace:
2024-01-17 12:57:39:     at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken
2024-01-17 12:57:39:     at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken
2024-01-17 12:57:39:     at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken
2024-01-17 12:57:39:     at System.Net.Http.DecompressionHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken
2024-01-17 12:57:39:     at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
2024-01-17 12:57:39:  AggregateException -- One or more errors occurred. (An error occurred while sending the request.)
2024-01-17 12:57:39:  AggregateException.InnerExceptions: ( System.Net.Http.HttpRequestException: An error occurred while sending the request.
2024-01-17 12:57:39:   ---> System.IO.IOException: Unable to read data from the transport connection: Connection reset by peer.
2024-01-17 12:57:39:   ---> System.Net.Sockets.SocketException (104): Connection reset by peer
2024-01-17 12:57:39:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.CreateException(SocketError error, Boolean forAsyncThrow)
2024-01-17 12:57:39:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ReceiveAsync(Socket socket, CancellationToken cancellationToken)
2024-01-17 12:57:39:     at System.Net.Sockets.NetworkStream.ReadAsync(Memory`1 buffer, CancellationToken cancellationToken)
2024-01-17 12:57:39:     at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](TIOAdapter adapter)
2024-01-17 12:57:39:     at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
2024-01-17 12:57:39:     at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](TIOAdapter adapter)
2024-01-17 12:57:39:     at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](TIOAdapter adapter, Memory`1 buffer)
2024-01-17 12:57:39:     at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
2024-01-17 12:57:39:     at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
2024-01-17 12:57:39:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow)

...

2024-01-17 12:57:39:  DeadlineHttpRequestException -- POST https://192.168.4.23:4433/db/slaves/info/save returned "One or more errors occurred. (An error occurred while sending the request.)"
2024-01-17 12:57:39:  DeadlineHttpRequestException.StatusCode:  (0)
2024-01-17 12:57:39:  DeadlineHttpRequestException.RequestHttpMethod: POST
2024-01-17 12:57:39:  DeadlineHttpRequestException.RequestURI: https://192.168.4.23:4433/db/slaves/info/save
2024-01-17 12:57:39:  Exception.TargetSite: System.Net.Http.HttpResponseMessage GetResponse(System.Net.Http.HttpRequestMessage)
2024-01-17 12:57:39:  Exception.Data: ( )
2024-01-17 12:57:39:  Exception.Source: deadline
2024-01-17 12:57:39:  Exception.HResult: -2146233088
2024-01-17 12:57:39:    Exception.StackTrace:
2024-01-17 12:57:39:     at Deadline.Net.Clients.Http.HttpClient.GetResponse(HttpRequestMessage request
2024-01-17 12:57:39:     at Deadline.Net.Clients.Http.HttpClient.b(HttpRequestMessage blz
2024-01-17 12:57:39:     at Deadline.Net.Clients.Http.HttpClient.SendRequestForStream(String method, String uri, String contentType, Dictionary`2 headers, HttpContent httpContent
2024-01-17 12:57:39:     at Deadline.Net.Clients.Http.HttpClient.SendRequest(String method, String uri, String contentType, Dictionary`2 headers, HttpContent httpContent
2024-01-17 12:57:39:     at Deadline.StorageDB.Proxy.Utils.NetworkManager.a[b](RequestData ss
2024-01-17 12:57:39:     at Deadline.StorageDB.Proxy.Utils.NetworkManager.SendRequest[T](RequestData requestData
2024-01-17 12:57:39:     at Deadline.StorageDB.Proxy.ProxySlaveStorage.SaveSlaveInfos(SlaveInfo[] infos)
2024-01-17 12:57:39:  WebException -- Connection Server error: POST https://192.168.XX.XX:4433/db/slaves/info/save returned "One or more errors occurred. (An error occurred while sending the request.)"
2024-01-17 12:57:39:  WebException.Status: UnknownError (16)
2024-01-17 12:57:39:  Exception.TargetSite: Void HandleException(System.Exception, Deadline.StorageDB.Proxy.Utils.NetworkManager, System.String, Int32, System.String)
2024-01-17 12:57:39:  Exception.Data: ( )
2024-01-17 12:57:39:  Exception.Source: deadline
2024-01-17 12:57:39:  Exception.HResult: -2146233088
2024-01-17 12:57:39:    Exception.StackTrace:
2024-01-17 12:57:39:     at Deadline.StorageDB.Proxy.Utils.ProxyUtils.HandleException(Exception e, NetworkManager manager, String server, Int32 port, String certificatePath
2024-01-17 12:57:39:     at Deadline.StorageDB.Proxy.ProxySlaveStorage.SaveSlaveInfos(SlaveInfo[] infos
2024-01-17 12:57:39:     at Deadline.StorageDB.Proxy.ProxySlaveStorage.SaveSlaveInfo(SlaveInfo info
2024-01-17 12:57:39:     at Deadline.Controllers.DataController.SaveSlaveInfo(SlaveInfo workerInfo
2024-01-17 12:57:39:     at Deadline.Controllers.RemoteDataController.SaveSlaveInfo(SlaveInfo workerInfo
2024-01-17 12:57:39:     at Deadline.Slaves.Slave.ReportSlaveInfo(Boolean collectMachineStats, Boolean isShuttingDown
2024-01-17 12:57:39:     at Deadline.Slaves.SlaveInfoThread.e(Boolean ajn)
2024-01-17 12:57:39:  ERROR: UpdateClient.MaybeSendRequestNow caught an exception: POST https://192.168.XX.XX:4433/rcs/v1/update returned "One or more errors occurred. (An error occurred while sending the request.)" (Deadline.Net.Clients.Http.DeadlineHttpRequestException)
2024-01-17 12:57:40:  ERROR: Scheduler Thread - Unexpected Error Occurred
2024-01-17 12:57:40:  Scheduler Thread - Connection Server error: GET https://192.168.XX.XX:4433/db/slaves/status?status=2 returned "One or more errors occurred. (Connection refused (192.168.XX.XX:4433))" (System.Net.WebException)
2024-01-17 12:57:40:  Scheduler Thread - >>> WORKER LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING
2024-01-17 12:57:41:  ERROR: Scheduler Thread - Unexpected Error Occurred
2024-01-17 12:57:41:  Scheduler Thread - Connection Server error: GET https://192.168.XX.XX:4433/db/dash/dawsSettings?invalidateCache=true returned "One or more errors occurred. (Connection refused (192.168.XX.XX:4433))" (System.Net.WebException)
2024-01-17 12:57:47:  The batch request reach the limit of 2 retries
2024-01-17 12:57:47:  ERROR: Worker - An error occurred while updating the Worker settings: Connection Server error: POST https://192.168.XX.XX:4433/db/slaves/settings?invalidateCache=true&fields=&transactionID=00000000-0000-0000-0000-000000000000 returned "One or more errors occurred. (POST https://192.168.4.23:4433/db/slaves/settings?invalidateCache=true&fields=&transactionID=00000000-0000-0000-0000-000000000000 returned "One or more errors occurred. (Connection refused (192.168.XX.XX:4433))") (POST https://192.168.4.23:4433/db/slaves/settings?invalidateCache=true&fields=&transactionID=00000000-0000-0000-0000-000000000000 returned "One or more errors occurred. (Connection refused (192.168.XX.XX:4433))") (POST https://192.168.4.23:4433/db/slaves/settings?invalidateCache=true&fields=&transactionID=00000000-0000-0000-0000-000000000000 returned "One or more errors occurred. (Connection refused (192.168.4.23:4433))")" (System.Net.WebException)
2024-01-17 12:57:47:  Scheduler Thread - >>> WORKER LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING
2024-01-17 12:57:49:  ERROR: Scheduler Thread - Unexpected Error Occurred
2024-01-17 12:57:49:  Scheduler Thread - Connection Server error: GET https://192.168.4.23:4433/db/dash/dawsSettings?invalidateCache=true returned "One or more errors occurred. (Connection refused (192.168.XX.XX:4433))" (System.Net.WebException)
2024-01-17 12:57:49:  Scheduler Thread - >>> WORKER LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING
2024-01-17 12:57:50:  ERROR: Scheduler Thread - Unexpected Error Occurred
2024-01-17 12:57:50:  Scheduler Thread - Connection Server error: GET https://192.168.4.23:4433/db/dash/dawsSettings?invalidateCache=true returned "One or more errors occurred. (Connection refused (192.168.XX.XX:4433))" (System.Net.WebException)
2024-01-17 12:57:54:  The batch request reach the limit of 2 retries


Some connection refused errors… but the tasks on these machines are eventually picked up and run… albeit failing due to licencing errors (fixable) but i cant resubmit with those env updates.

So i tried a submission from the integrated nuke submitter. Cleared out all jobs and the console and got the same dump to the consolse

2024-01-17 14:33:41:  Error occurred while updating Deadline AWS Resource Tracker Status label: Connection Server error: GET https://192.168.XX.XX:4433/db/dash/dashFleet/health?region=us-gov-west-1 returned "One or more errors occurred. (An error occurred while sending the request.)" (System.Net.WebException)
2024-01-17 14:33:41:     at Deadline.StorageDB.Proxy.Utils.ProxyUtils.HandleException(Exception e, NetworkManager manager, String server, Int32 port, String certificatePath)
2024-01-17 14:33:41:     at Deadline.StorageDB.Proxy.ProxyDashStorage.GetFleetHealthSummary(String awsRegion, String& fleetsHealthReport)
2024-01-17 14:33:41:     at Deadline.StorageDB.DashStorage.UpdateResourceTrackerStatusLabel(Object o)
2024-01-17 14:33:41:  ---------- Inner Stack Trace (System.Net.Sockets.SocketException) ----------
2024-01-17 14:33:41:  Error occurred while updating Worker cache: Connection Server error: POST https://192.168.XX.XX:4433/db/slaves/modified?transactionID=6104bce4-dfac-46c3-8106-0b60e3820436 returned "One or more errors occurred. (An error occurred while sending the request.)" (System.Net.WebException)
2024-01-17 14:33:41:  Error occurred while updating balancer cache: Connection Server error: POST https://192.168.XX.XX:4433/db/balancers/modified returned "One or more errors occurred. (An error occurred while sending the request.)" (System.Net.WebException)
2024-01-17 14:33:41:     at Deadline.StorageDB.Proxy.Utils.ProxyUtils.HandleException(Exception e, NetworkManager manager, String server, Int32 port, String certificatePath)
2024-01-17 14:33:41:     at Deadline.StorageDB.Proxy.ProxyBalancerStorage.GetModifiedBalancers(BalancerInfoSettings[]& modifiedBalancers, String[]& deletedBalancerIds, Nullable`1 lastSettingsAutoUpdate, Nullable`1 lastInfoAutoUpdate, Nullable`1 lastDeletionAutoUpdate)
2024-01-17 14:33:41:     at Deadline.StorageDB.BalancerStorage.a(Object ji)
2024-01-17 14:33:41:  ---------- Inner Stack Trace (System.Net.Sockets.SocketException) ----------
2024-01-17 14:33:41:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.CreateException(SocketError error, Boolean forAsyncThrow)
2024-01-17 14:33:41:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ReceiveAsync(Socket socket, CancellationToken cancellationToken)
2024-01-17 14:33:41:     at System.Net.Sockets.NetworkStream.ReadAsync(Memory`1 buffer, CancellationToken cancellationToken)
2024-01-17 14:33:41:     at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](TIOAdapter adapter)
2024-01-17 14:33:41:     at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
2024-01-17 14:33:41:     at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](TIOAdapter adapter)
2024-01-17 14:33:41:     at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](TIOAdapter adapter, Memory`1 buffer)
2024-01-17 14:33:41:     at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
2024-01-17 14:33:41:     at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
2024-01-17 14:33:41:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow)
2024-01-17 14:33:41:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs _)
2024-01-17 14:33:41:     at System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute()
2024-01-17 14:33:41:     at System.Threading.ThreadPoolWorkQueue.Dispatch()
2024-01-17 14:33:41:     at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
2024-01-17 14:33:41:  --- End of stack trace from previous location ---

Seems like there is some AWS Portal authentication being attempted when we are not looking to use it. (AWS infrastructure was taken down)

I have pulse running on one of my headless render nodes, i checked its logs… and it reports something similar.

2024-01-17 14:33:41:  Pulse - An error occurred while updating pulse settings: Connection Server error: GET https://192.168.XX.XX:4433/db/pulse/settings/render119?invalidateCache=true returned "One or more errors occurred. (The SSL connection could not be established, see inner exception.)" (System.Net.WebException)
2024-01-17 14:33:41:  Info Thread - Could not write pulseInfo -- a recoverable error occurred
2024-01-17 14:33:41:  >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
2024-01-17 14:33:41:  ERROR: GetRepositoryDateTime: Connection Server error: GET https://192.168.XX.XX:4433/db/settings/serverdatetime returned "One or more errors occurred. (The SSL connection could not be established, see inner exception.)" (System.Net.WebException)
2024-01-17 14:33:41:  Exception Details
2024-01-17 14:33:41:  SocketException -- Connection reset by peer
2024-01-17 14:33:41:  SocketException.SocketErrorCode: ConnectionReset (10054)
2024-01-17 14:33:41:  SocketException.ErrorCode: 104 (Connection reset by peer)
2024-01-17 14:33:41:  Win32Exception.NativeErrorCode: 104
2024-01-17 14:33:41:  Exception.Data: ( )
2024-01-17 14:33:41:  Exception.HResult: -2147467259
2024-01-17 14:33:41:    Exception.StackTrace: 
2024-01-17 14:33:41:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.CreateException(SocketError error, Boolean forAsyncThrow
2024-01-17 14:33:41:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ReceiveAsync(Socket socket, CancellationToken cancellationToken
2024-01-17 14:33:41:     at System.Net.Sockets.NetworkStream.ReadAsync(Memory`1 buffer, CancellationToken cancellationToken
2024-01-17 14:33:41:     at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](TIOAdapter adapter
2024-01-17 14:33:41:     at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine
2024-01-17 14:33:41:     at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](TIOAdapter adapter
2024-01-17 14:33:41:     at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](TIOAdapter adapter, Memory`1 buffer
2024-01-17 14:33:41:     at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state
2024-01-17 14:33:41:     at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread
2024-01-17 14:33:41:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow
2024-01-17 14:33:41:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs _
2024-01-17 14:33:41:     at System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute(
2024-01-17 14:33:41:     at System.Threading.ThreadPoolWorkQueue.Dispatch(
2024-01-17 14:33:41:     at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
2024-01-17 14:33:41:  --- End of stack trace from previous location ---
2024-01-17 14:33:41:  IOException -- Unable to read data from the transport connection: Connection reset by peer.
2024-01-17 14:33:41:  Exception.TargetSite: Void Throw()
2024-01-17 14:33:41:  Exception.Data: ( )
2024-01-17 14:33:41:  Exception.Source: System.Private.CoreLib
2024-01-17 14:33:41:  Exception.HResult: -2146232800
2024-01-17 14:33:41:    Exception.StackTrace: 
2024-01-17 14:33:41:     at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](TIOAdapter adapter
2024-01-17 14:33:41:     at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](TIOAdapter adapter, Memory`1 buffer
2024-01-17 14:33:41:     at System.Net.Http.HttpConnection.<CheckUsabilityOnScavenge>g__ReadAheadWithZeroByteReadAsync|45_0(
2024-01-17 14:33:41:     at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
2024-01-17 14:33:41:  HttpRequestException -- An error occurred while sending the request.
2024-01-17 14:33:41:  Exception.TargetSite: Void MoveNext()
2024-01-17 14:33:41:  Exception.Data: ( )
2024-01-17 14:33:41:  Exception.Source: System.Net.Http
2024-01-17 14:33:41:  Exception.HResult: -2146232800
2024-01-17 14:33:41:    Exception.StackTrace: 
2024-01-17 14:33:41:     at System.Net.Http.HttpConnection.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken

Checking ports:

[root@mysql02 ~]# netstat -tulpn
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name    
tcp        0      0 0.0.0.0:27100           0.0.0.0:*               LISTEN      109792/mongod       
tcp        0      0 0.0.0.0:4433            0.0.0.0:*               LISTEN      39985/deadlinercs.e 
tcp6       0      0 :::27100                :::*                    LISTEN      109792/mongod
tcp6       0      0 :::17000                :::*                    LISTEN      1292/deadlinelaunch 
tcp6       0      0 ::1:40466               :::*                    LISTEN      39985/deadlinercs.e 
tcp6       0      0 ::1:44562               :::*                    LISTEN      1292/deadlinelaunch

I tried to dumb this down even more by removing the pulse and balancer processes i had running on farm nodes as this traceback was appearing in their logs. I then tried to do a simple external submission of a python script to echo sys.path and os.environ . I also found the Python Settings in the Configure Repository Options and set the Event Sandbox Python Version to 2, since the deadlinesandbox process on my renderXX workers was passing a -py-ver 3 argument. This now passed the -py-ver 2 argument, but there was no difference in all of this.

The first Task Spins, the remaining do as they are told.

The first task takes minutes to pick up and is just stalled in a Waiting to Start State. It is what spit this into the console before the other tasks eventually run.

2024-01-17 19:16:59:  Running script PythonSubmission (/home/sdugaro/Thinkbox/Deadline10/cache/q3UNeMyV11ydp4KDU3ru0w0QmY/scripts/Submission/PythonSubmission.py)
2024-01-17 20:03:03:  Error occurred while updating pulse cache: Connection Server error: POST https://192.168.XX.XX:4433/db/pulse/modified returned "One or more errors occurred. (An error occurred while sending the request.)" (System.Net.WebException)
2024-01-17 20:03:03:     at Deadline.StorageDB.Proxy.Utils.ProxyUtils.HandleException(Exception e, NetworkManager manager, String server, Int32 port, String certificatePath)
2024-01-17 20:03:03:     at Deadline.StorageDB.Proxy.ProxyPulseStorage.GetModifiedPulses(PulseInfoSettings[]& modifiedPulses, String[]& deletedPulseIds, Nullable`1 lastInfoAutoUpdate, Nullable`1 lastSettingsAutoUpdate, Nullable`1 lastDeletionAutoUpdate)
2024-01-17 20:03:03:     at Deadline.StorageDB.PulseStorage.a(Object nc)
2024-01-17 20:03:03:  ---------- Inner Stack Trace (System.Net.Sockets.SocketException) ----------
2024-01-17 20:03:03:  Error occurred while updating Worker cache: Connection Server error: POST https://192.168.XX.XX:4433/db/slaves/modified?transactionID=a574ab8c-22a8-41d1-8395-3afcdb11ed7c returned "One or more errors occurred. (An error occurred while sending the request.)" (System.Net.WebException)
2024-01-17 20:03:03:  Error occurred while updating RCS cache: Connection Server error: POST https://192.168.XX.XX:4433/db/proxyservers/modified returned "One or more errors occurred. (Connection refused (192.168.XX.XX:4433))" (System.Net.WebException)
2024-01-17 20:03:03:     at Deadline.StorageDB.Proxy.Utils.ProxyUtils.HandleException(Exception e, NetworkManager manager, String server, Int32 port, String certificatePath)
2024-01-17 20:03:03:     at Deadline.StorageDB.Proxy.ProxyProxyServerStorage.GetModifiedProxyServers(ProxyServerInfoSettings[]& modifiedProxyServers, String[]& deletedProxyServerIds, Nullable`1 lastSettingsAutoUpdate, Nullable`1 lastInfoAutoUpdate, Nullable`1 lastDeletionAutoUpdate)
2024-01-17 20:03:03:     at Deadline.StorageDB.ProxyServerStorage.a(Object mv)
2024-01-17 20:03:03:  ---------- Inner Stack Trace (System.Net.Sockets.SocketException) ----------
2024-01-17 20:03:03:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
2024-01-17 20:03:03:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
2024-01-17 20:03:03:     at System.Net.Sockets.Socket.<ConnectAsync>g__WaitForConnectWithCancellation|277_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
2024-01-17 20:03:03:     at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
2024-01-17 20:03:03:  Error occurred while updating Worker reports: Connection Server error: GET https://192.168.XX.XX:4433/db/slaves/report/render111 returned "One or more errors occurred. (Connection refused (192.168.XX.XX:4433))" (System.Net.WebException)
2024-01-17 20:03:03:  Error occurred while updating job cache: Connection Server error: POST https://192.168.XX.XX:4433/db/jobs/modified?transactionID=00000000-0000-0000-0000-000000000000&invalidateCache=true returned "One or more errors occurred. (Connection refused (192.168.XX.XX:4433))" (System.Net.WebException)
2024-01-17 20:03:03:     at Deadline.StorageDB.Proxy.Utils.ProxyUtils.HandleException(Exception e, NetworkManager manager, String server, Int32 port, String certificatePath)
2024-01-17 20:03:03:     at Deadline.StorageDB.Proxy.ProxyJobStorage.GetModifiedJobs(Job[]& modifiedJobs, String[]& deletedJobIds, Boolean& hasMore, String& transactionID, Nullable`1 batchQueryTime, Nullable`1 deleteQueryTime, Nullable`1 firstBatchFlag)
2024-01-17 20:03:03:     at Deadline.StorageDB.JobStorage.b(Object kp)
2024-01-17 20:03:03:  ---------- Inner Stack Trace (System.Net.Sockets.SocketException) ----------

The tasks that ran the python script show relevant environment settings as


=======================================================
Log
=======================================================
2024-01-17 20:07:27:  0: Start Job timeout is disabled.
2024-01-17 20:07:27:  0: Task timeout is disabled.
2024-01-17 20:07:27:  0: Plugin rendering frame(s): 5
2024-01-17 20:07:27:  0: Executing plugin command of type 'Render Task'
2024-01-17 20:07:27:  0: INFO: Stdout Redirection Enabled: True
2024-01-17 20:07:27:  0: INFO: Asynchronous Stdout Enabled: False
2024-01-17 20:07:27:  0: INFO: Stdout Handling Enabled: True
2024-01-17 20:07:27:  0: INFO: Popup Handling Enabled: False
2024-01-17 20:07:27:  0: INFO: Using Process Tree: True
2024-01-17 20:07:27:  0: INFO: Hiding DOS Window: True
2024-01-17 20:07:27:  0: INFO: Creating New Console: False
2024-01-17 20:07:27:  0: INFO: Running as user: root
2024-01-17 20:07:27:  0: INFO: Executable: "/usr/bin/python"
2024-01-17 20:07:27:  0: INFO: Argument: -u "/var/tmp/render118/jobsData/65a8a2609671d319346a11c8/print_env.py" 
2024-01-17 20:07:27:  0: INFO: Full Command: "/usr/bin/python" -u "/var/tmp/render118/jobsData/65a8a2609671d319346a11c8/print_env.py" 
2024-01-17 20:07:27:  0: INFO: Startup Directory: "/usr/bin"
2024-01-17 20:07:27:  0: INFO: Process Priority: BelowNormal
2024-01-17 20:07:27:  0: INFO: Process Affinity: default
2024-01-17 20:07:27:  0: INFO: Process is now running

2024-01-17 20:07:27:  0: STDOUT:  ('DEADLINE_BIN', '/usr/bin/python'),
2024-01-17 20:07:27:  0: STDOUT:  ('DEADLINE_CONFIG_FILE',
2024-01-17 20:07:27:  0: STDOUT:   '/apps/apps/lin_centos7/Thinkbox/deadline/10/deadline.ini'),
2024-01-17 20:07:27:  0: STDOUT:  ('DEADLINE_PATH', '/apps/apps/lin_centos7/Thinkbox/deadline/10.2.1.1/bin'),
2024-01-17 20:07:27:  0: STDOUT:  ('DEADLINE_USER', 'sdugaro'),
2024-01-17 20:07:27:  0: STDOUT:  ('USER', 'root'),
2024-01-17 20:07:27:  0: STDOUT:  ('SHELL', '/bin/sh'),
2024-01-17 20:07:27:  0: STDOUT:  ('PWD', '/apps/apps/lin_centos7/Thinkbox/deadline/10.2.1.1/bin')

# NO PYTHONPATH but sys.path is set to 

2024-01-17 20:07:27:  0: STDOUT: ['/var/tmp/render118/jobsData/65a8a2609671d319346a11c8',
2024-01-17 20:07:27:  0: STDOUT:  '/usr/lib64/python27.zip',
2024-01-17 20:07:27:  0: STDOUT:  '/usr/lib64/python2.7',
2024-01-17 20:07:27:  0: STDOUT:  '/usr/lib64/python2.7/plat-linux2',
2024-01-17 20:07:27:  0: STDOUT:  '/usr/lib64/python2.7/lib-tk',
2024-01-17 20:07:27:  0: STDOUT:  '/usr/lib64/python2.7/lib-old',
2024-01-17 20:07:27:  0: STDOUT:  '/usr/lib64/python2.7/lib-dynload',
2024-01-17 20:07:27:  0: STDOUT:  '/usr/lib64/python2.7/site-packages',
2024-01-17 20:07:27:  0: STDOUT:  '/usr/lib64/python2.7/site-packages/gtk-2.0',
2024-01-17 20:07:27:  0: STDOUT:  '/usr/lib/python2.7/site-packages']

And Long after all the subsequent tasks Complete, the first Task just keeps spewing the same message to the Console indefinitely (30 mins in)


2024-01-17 20:03:03:  ---------- Inner Stack Trace (System.Net.Sockets.SocketException) ----------
2024-01-17 20:03:03:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
2024-01-17 20:03:03:  Error occurred while updating limit group cache: Connection Server error: GET https://192.168.4.23:4433/db/limitgroups/modified?lastAutoUpdate=0001-01-01T00%3A00%3A00.0000000Z&lastDeletion=2024-01-18T02%3A41%3A28.7660000Z returned "One or more errors occurred. (Connection refused (192.168.4.23:4433))" (System.Net.WebException)
2024-01-17 20:03:03:     at Deadline.StorageDB.Proxy.Utils.ProxyUtils.HandleException(Exception e, NetworkManager manager, String server, Int32 port, String certificatePath)
2024-01-17 20:03:03:  ---------- Inner Stack Trace (System.Net.Sockets.SocketException) ----------
2024-01-17 20:03:03:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
2024-01-17 20:03:03:  Error occurred while reloading network settings: Connection Server error: GET https://192.168
2024-01-17 20:03:03:  Error occurred while updating balancer cache: Connection Server error: POST
2024-01-17 20:03:03:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
2024-01-17 20:03:03:     at Deadline.StorageDB.Proxy.Utils.ProxyUtils.HandleException(Exception e, NetworkManager manager, String server, Int32 port, String certificatePath)
2024-01-17 20:03:03:  ---------- Inner Stack Trace (System.Net.Sockets.SocketException) ----------
2024-01-17 20:03:03:     at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
....
2024-01-17 20:07:04:  ERROR: UpdateClient.MaybeSendRequestNow caught an exception: POST 
2024-01-17 20:07:04:  The batch request reach the limit of 2 retries
2024-01-17 20:07:05:  Error occurred while updating Deadline AWS Resource Tracker Status label: Connection Server error:

Not using AWS but the AWS Portal Panel is open so I also tried logging out. then I Re-Queued the First Task and it picked up and complted immediately ?!?!?

Curiously, the Failing/ReQueue Task0’s log (the produced no errors in the console) but its log was considerably different than the others. Essentially all the bash environment variables that our pipeline configures (licencing, app_serverr DCC paths, etc…) were now no longer present in os.environ. This came as a big surprise… This was a minimal /bin/sh environment, while the others pushed through all the environment variables defined in our centos7 bash terminal environment. ??? why the difference???

=======================================================
Log
=======================================================
2024-01-17 20:41:54:  0: Loading Job's Plugin timeout is Disabled
2024-01-17 20:41:54:  0: SandboxedPlugin: Render Job As User disabled, running as current user 'root'
2024-01-17 20:41:58:  0: Executing plugin command of type 'Initialize Plugin'
2024-01-17 20:41:58:  0: INFO: Executing plugin script '/var/tmp/render27/plugins/65a8a2609671d319346a11c8/Python.py'
2024-01-17 20:41:58:  0: INFO: Plugin execution sandbox using Python version 3
2024-01-17 20:41:58:  0: INFO: Single Frames Only: False
2024-01-17 20:41:58:  0: INFO: About: Python Plugin for Deadline
2024-01-17 20:41:58:  0: INFO: The job's environment will be merged with the current environment before rendering
2024-01-17 20:41:58:  0: Done executing plugin command of type 'Initialize Plugin'
2024-01-17 20:41:58:  0: Start Job timeout is disabled.
2024-01-17 20:41:58:  0: Task timeout is disabled.
2024-01-17 20:41:58:  0: Loaded job: external python (65a8a2609671d319346a11c8)
2024-01-17 20:41:58:  0: Executing plugin command of type 'Start Job'
2024-01-17 20:41:58:  0: DEBUG: S3BackedCache Client is not installed.
2024-01-17 20:41:58:  0: INFO: Executing global asset transfer preload script '/var/tmp/render27/plugins/65a8a2609671d319346a11c8/GlobalAssetTransferPreLoad.py'
2024-01-17 20:41:58:  0: INFO: Looking for legacy (pre-10.0.26) AWS Portal File Transfer...
2024-01-17 20:41:58:  0: INFO: Looking for legacy (pre-10.0.26) File Transfer controller in /opt/Thinkbox/S3BackedCache/bin/task.py...
2024-01-17 20:41:58:  0: INFO: Could not find legacy (pre-10.0.26) AWS Portal File Transfer.
2024-01-17 20:41:58:  0: INFO: Legacy (pre-10.0.26) AWS Portal File Transfer is not installed on the system.
2024-01-17 20:41:58:  0: Done executing plugin command of type 'Start Job'
2024-01-17 20:41:58:  0: Plugin rendering frame(s): 1
2024-01-17 20:41:58:  0: Executing plugin command of type 'Render Task'
2024-01-17 20:41:58:  0: INFO: Stdout Redirection Enabled: True
2024-01-17 20:41:58:  0: INFO: Asynchronous Stdout Enabled: False
2024-01-17 20:41:58:  0: INFO: Stdout Handling Enabled: True
2024-01-17 20:41:58:  0: INFO: Popup Handling Enabled: False
2024-01-17 20:41:58:  0: INFO: Using Process Tree: True
2024-01-17 20:41:58:  0: INFO: Hiding DOS Window: True
2024-01-17 20:41:58:  0: INFO: Creating New Console: False
2024-01-17 20:41:58:  0: INFO: Running as user: root
2024-01-17 20:41:58:  0: INFO: Executable: "/usr/bin/python"
2024-01-17 20:41:58:  0: INFO: Argument: -u "/var/tmp/render27/jobsData/65a8a2609671d319346a11c8/print_env.py" 
2024-01-17 20:41:58:  0: INFO: Full Command: "/usr/bin/python" -u "/var/tmp/render27/jobsData/65a8a2609671d319346a11c8/print_env.py" 
2024-01-17 20:41:58:  0: INFO: Startup Directory: "/usr/bin"
2024-01-17 20:41:58:  0: INFO: Process Priority: BelowNormal
2024-01-17 20:41:58:  0: INFO: Process Affinity: default
2024-01-17 20:41:58:  0: INFO: Process is now running
2024-01-17 20:41:58:  0: STDOUT: [('DEADLINE_BIN', '/usr/bin/python'),
2024-01-17 20:41:58:  0: STDOUT:  ('DEADLINE_CONFIG_FILE',
2024-01-17 20:41:58:  0: STDOUT:   '/apps/apps/lin_centos7/Thinkbox/deadline/10/deadline.ini'),
2024-01-17 20:41:58:  0: STDOUT:  ('DEADLINE_PATH', '/apps/apps/lin_centos7/Thinkbox/deadline/10.2.1.1/bin'),
2024-01-17 20:41:58:  0: STDOUT:  ('DEADLINE_USER', 'sdugaro'),
2024-01-17 20:41:58:  0: STDOUT:  ('LANG', 'en_US.UTF-8'),
2024-01-17 20:41:58:  0: STDOUT:  ('LOGNAME', 'root'),
2024-01-17 20:41:58:  0: STDOUT:  ('PATH',
2024-01-17 20:41:58:  0: STDOUT:   '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/bin:/usr/bin:/sbin:/bin:/usr/bin'),
2024-01-17 20:41:58:  0: STDOUT:  ('PWD', '/apps/apps/lin_centos7/Thinkbox/deadline/10.2.1.1/bin'),
2024-01-17 20:41:58:  0: STDOUT:  ('SHELL', '/bin/sh'),
2024-01-17 20:41:58:  0: STDOUT:  ('SHLVL', '0'),
2024-01-17 20:41:58:  0: STDOUT:  ('USER', 'root')]
2024-01-17 20:41:58:  0: STDOUT: ['/var/tmp/render27/jobsData/65a8a2609671d319346a11c8',
2024-01-17 20:41:58:  0: STDOUT:  '/usr/lib64/python27.zip',
2024-01-17 20:41:58:  0: STDOUT:  '/usr/lib64/python2.7',
2024-01-17 20:41:58:  0: STDOUT:  '/usr/lib64/python2.7/plat-linux2',
2024-01-17 20:41:58:  0: STDOUT:  '/usr/lib64/python2.7/lib-tk',
2024-01-17 20:41:58:  0: STDOUT:  '/usr/lib64/python2.7/lib-old',
2024-01-17 20:41:58:  0: STDOUT:  '/usr/lib64/python2.7/lib-dynload',
2024-01-17 20:41:58:  0: STDOUT:  '/usr/lib64/python2.7/site-packages',
2024-01-17 20:41:58:  0: STDOUT:  '/usr/lib64/python2.7/site-packages/gtk-2.0',
2024-01-17 20:41:58:  0: STDOUT:  '/usr/lib/python2.7/site-packages']
2024-01-17 20:41:59:  0: INFO: Process exit code: 0
2024-01-17 20:41:59:  0: Done executing plugin command of type 'Render Task'

=======================================================
Details
=======================================================
Date: 01/17/2024 20:42:00
Frames: 1
Job Submit Date: 01/17/2024 20:00:32
Job User: sdugaro
Average RAM Usage: 112132096 (1%)
Peak RAM Usage: 112132096 (1%)
Average CPU Usage: 3%
Peak CPU Usage: 3%
Used CPU Clocks (x10^6 cycles): 0
Total CPU Clocks (x10^6 cycles): 0

=======================================================
Worker Information
=======================================================
Worker Name: render27
Version: v10.2.1.1 Release (094cbe890)
Operating System: CentOS Linux release 7.5.1804 (Core) 
Machine User: root
IP Address: 192.168.5.27
MAC Address: 00:25:90:A7:05:A9
CPU Architecture: x86_64
CPUs: 32
CPU Usage: 6%
Memory Usage: 17.4 GB / 62.7 GB (27%)
Free Disk Space: 56.560 GB (28.280 GB on /, 28.280 GB on /)
Video Card: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)

afaict, the latter tasks picked renderXXX machines which had a deadlinelauncher daemon steup, but this one only had a deadlineworker daemon configured.

bit of a red herring… the deadlineworker daemon was setup different than the deadlinelauncher daemon where the latter invokes /user/bin/bash -l -c and former doesn’t

But I submitted jobs with this same python file to deadlineworker only daemon machines and deadlinelauncher only daemon machines (logged out of the AWS Panel) and the result is exactly the same with the first task hanging in each job. The Task will spin indefinitely until I Re-Queue the task, aftewhich it completes immediately… Any idea what this “LEADER TASK” is doing that is so different - and can I suppress it?

I then only sent this job to the render27:deadlineworker machine. Same thing. Any amount of Failing/Suspending/Completing/Deleting the job makes no difference and does not stop this hung process… the Worker panel shows render27 in an Idle state, until the other jobs Render with it, and then it goes back to an Idle state with the Task still spinning on “Waiting to Start”. I also cannot Re-Queue the First task until after the remaining tasks complete… After manually forcing the job to complete I can connect to the Worker Log from the Task Panel to see the Render Thread 0 with successful completion. But the Scheduler Thread having issues related to dash which was the ssl config setup with the AWS Portal Installation.

2024-01-17 21:45:32:  Scheduler Thread - POST https://192.168.XX.XX:4433/rcs/v1/assigntask returned "One or more errors occurred. (A task was canceled.)" (Deadline.Net.Clients.Http.DeadlineHttpRequestException)
2024-01-17 21:46:21:  Scheduler Thread - POST https://192.168.XX.XX:4433/rcs/v1/assigntask returned "One or more errors occurred. (An error occurred while sending the request.)" (Deadline.Net.Clients.Http.DeadlineHttpRequestException)
2024-01-17 21:46:23:  Scheduler Thread - Connection Server error: GET https://192.168.XX.XX:4433/db/dash/dawsSettings?invalidateCache=true returned "One or more errors occurred. (Connection refused (192.168.XX.XX:4433))" (System.Net.WebException)
2024-01-17 21:46:23:  Scheduler Thread - >>> WORKER LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING
2024-01-17 21:46:24:  Scheduler Thread - Connection Server error: GET https://192.168.XX.XX:4433/db/dash/dawsSettings?invalidateCache=true returned "One or more errors occurred. (Connection refused (192.168.XX.XX:4433))" (System.Net.WebException)
2024-01-17 21:46:30:  Scheduler Thread - >>> WORKER LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING
2024-01-17 21:46:32:  Scheduler Thread - Connection Server error: GET https://192.168.XX.XX:4433/db/dash/dawsSettings?invalidateCache=true returned "One or more errors occurred. (Connection refused (192.168.XX.XX:4433))" (System.Net.WebException)
2024-01-17 21:46:32:  Scheduler Thread - >>> WORKER LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING
2024-01-17 21:46:33:  Scheduler Thread - Connection Server error: GET https://192.168.XX.XX:4433/db/dash/dawsSettings?invalidateCache=true returned "One or more errors occurred. (Connection refused (192.168.XX.XX:4433))" (System.Net.WebException)
2024-01-17 21:46:40:  Scheduler Thread - >>> WORKER LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING
2024-01-17 21:46:41:  Scheduler Thread - Connection Server error: GET https://192.168.XX.XX:4433/db/dash/dawsSettings?invalidateCache=true returned "One or more errors occurred. (Connection refused (192.168.XX.XX:4433))" (System.Net.WebException)
2024-01-17 21:46:41:  Scheduler Thread - >>> WORKER LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING
2024-01-17 21:46:43:  Scheduler Thread - Connection Server error: GET https://192.168.XX.XX:4433/db/dash/dawsSettings?invalidateCache=true returned "One or more errors occurred. (Connection refused (192.168.XX.XX:4433))" (System.Net.WebException)
2024-01-17 21:46:49:  Scheduler Thread - >>> WORKER LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING
2024-01-17 21:46:50:  Scheduler Thread - Connection Server error: GET https://192.168.XX.XX:4433/db/dash/dawsSettings?invalidateCache=true returned "One or more errors occurred. (Connection refused (192.168.4.23:4433))" (System.Net.WebException)
2024-01-17 21:46:56:  Scheduler Thread - >>> WORKER LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING
2024-01-17 21:46:58:  Scheduler Thread - Connection Server error: GET https://192.168.XX.XX:4433/db/dash/dawsSettings?invalidateCache=true returned "One or more errors occurred. (Connection refused (192.168.XX.XX:4433))" (System.Net.WebException)
2024-01-17 21:47:04:  Scheduler Thread - >>> WORKER LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING
2024-01-17 21:47:05:  Scheduler Thread - Connection Server error: GET https://192.168.XX.XX:4433/db/dash/dawsSettings?invalidateCache=true returned "One or more errors occurred. (Connection refused (192.168.XX.XX:4433))" (System.Net.WebException)
2024-01-17 21:47:05:  Scheduler Thread - >>> WORKER LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING
2024-01-17 21:47:07:  Scheduler Thread - Connection Server error: GET https://192.168.XX.XX:4433/db/dash/dawsSettings?invalidateCache=true returned "One or more errors occurred. (Connection refused (192.168.XX.XX:4433))" (System.Net.WebException)
2024-01-17 21:47:13:  Scheduler Thread - >>> WORKER LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING
2024-01-17 21:47:15:  Scheduler Thread - >>> WORKER LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING
2024-01-17 21:47:18:  Scheduler Thread - >>> WORKER LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING
2024-01-17 21:47:20:  Scheduler Thread - >>> WORKER LOST CONNECTION TO THE REPOSITORY, SKIPPING TASK DEQUEUING
2024-01-17 21:47:22:  Scheduler Thread - Job's Limit Groups: 
2024-01-17 21:47:44:  Scheduler Thread - Job's Limit Groups: 

Hmm! so it seems, every time I tried to submit a job the Scheduler thread would take down my RCS server. while it was going down the smattering of socket errors started to manifest on my singular render27 deadlineworker machine… because it couldn’t authenticate… various messages all trying to figure out why… It seems that caused my RCS server to overflow and eventually go down… the deamon would restart to bring the RCS server back online, at which it started asking if there were any tasks that needed attention, this got all the other Tasks running to completion. The RCS server was now up and running again without being pummled by scheduler thread authentication queries - hence the ability to Re-Queue the original Task with the scheduler thread.

Now while I had a job running with one of these original scheduler Tasks waiting, I cleared the Console and forced my RCS service to stop… and boom. all those messages appeared in the Console again immediately.

What this looked to boil down to was the bash -l -c in the RCS deadline10launcher.service picked up a PYTHONPATH from our that was very unwelcome…(vfxplatform/python2.7/site-packages)
clearing out the PYTHONPATH just prior to issuing the deadlinelauncher immediately got the Stalled “Scheduler” Task going again… without any need to reQueue the task.

Tested the job submission one more time and it picked up the first scheduler task immediately without issue, followed by the secondary tasks.

Wow, moral of the story… beware, or rather be aware, of whats in your bash shell.
There can be python libraries that can easily conflict in a python2-3 transitional world.

Nice work! I think we can update our launch scripts for Deadline applications to chuck our Python dependencies at the top of PYTHONPATH so other applications can’t blow us up. This used to be the case in our Python 2 implementation, but in 10.3 we moved completely to Py3.10 and it looks like that’s either no longer happening or not happening correctly. I’ll add this thread to our tracking ticket and get back here if we have anything helpful to add.

Otherwise any fixes will be up in the release notes as usual.

Privacy | Site terms | Cookie preferences