AWS Thinkbox Discussion Forums

Erratic timeout when communicating with server

I’m getting very odd timeout and communication issues with deadline.
Machines will fail to start deadline monitor application (often 4+ times before they then just work).
Render jobs as well (mayacmd) will fail the first 1+ times before then succeeding, returning the error “Failed to load the plugin because: Could not initialize the plugin sandbox”, which digging into the log shows “The connection attempt timed out”. Funny thing is they’re far less likely to error if a connection was already established, like when moving from one task immediately to the next, but give it a moment between one task and the next, and it errors again.

I’ve searched posts, read the docs, and the only thing I’ve found pointed to maybe a DNS issue… and I have none. All machines can ping via ip and name, yet connection attempts to both ip and name show up as timed out in the log. All installs of deadline are the same (install was run through a script), and yet some machines seem more prone to the error than others, but none work 100% of the time.
Even on machines that run monitor “more reliably” (ie work often or only need a few attempts), the render scripts still exhibit this connection problem causing the first attempt to error, and need retries to succeed.

My network setup is simple… just a windows fileshare and some machines, in a standard windows workgroup (no domain, no group policy), IPV4 static IP’s.
Since the network is air-gapped, the firewalls are extremely permissive to non existent, and pose no problem for the myriad of software we use to make it through without issue, including our old render manager, and a dozen other license servers running on this same server in this same configuration for years.
My deadline workers are setup with a direct connection to the repository which is on the same server and even the same drive volume as our main project storage (which no machine has any problems accessing). The user the deadline service is configured to use has full system access.
Firewall is the likely culprit however like I said I’ve even disabled them to no effect, and if it were a firewall issue it should never work, not work sometimes.

Deadline 10.1.12
Mogo 3.6.19
Installed as service on workers
Pulse is running
Deadline RCS is running
Remote commands succeed to all workers (seemingly without timeout issues)
Windows Server 2016 (deadline server)
Windows 10 (workers)

Any insights?

Thanks in advance.

Privacy | Site terms | Cookie preferences