Something really odd is happening. We have lost connectivity to a cluster of machines (IT issue) but deadline never marked the machines as stalled:
Its 5.54am right now, long after the stall should have been noticed…
Something really odd is happening. We have lost connectivity to a cluster of machines (IT issue) but deadline never marked the machines as stalled:
Its 5.54am right now, long after the stall should have been noticed…
Seems like a dead mount point hung up the dependency check, which in turn hung up pulse completely. Should that happen? They are all in separate processes to avoid such a scenario.
Hmmm, it definitely shouldn’t affect Deadline specifically… Was it an NFS mount? I know I’ve had some real nasty hanging behaviour (even just at the OS-level trying to do ls
and such) with NFS specifically when the server goes down.
There might be some timeouts we can tweak on our side when doing file operations, but I can’t imagine there is NO timeout on those by default… Might be something else going on here. Like you alluded to – it should be, at the worst case, killing the process if the timeout is reached. What are your timeout settings set to for the various housecleaning processes?
Could you get us the tail of the Pulse log? Would be nice to know what it was doing at the time (presumably scanning something on the failed mount point).
Yeah it was an nfs mount. The weird thing is, the main pulse hanging up… i’ve tried restarting it, but it hung up again, and then automatically resumed operations when the connection was restored around 9.10am:
2016-12-15 05:58:15: Server Thread - Auto Configuration: Configuration sent
2016-12-15 05:58:17: Server Thread - Auto Configuration: Received packet on autoconfig port
2016-12-15 05:58:17: Server Thread - Auto Configuration: Picking configuration based on: LAPRO1797 / ::ffff:172.18.10.44
2016-12-15 05:58:17: Server Thread - Auto Configuration: Match found for rule Permission Check Disable (IPMatch)
2016-12-15 05:58:17: Server Thread - Auto Configuration: Match found for rule Disable Slave Startup Launch (IPMatch)
2016-12-15 05:58:17: Server Thread - Auto Configuration: Created a configuration worth sending
2016-12-15 05:58:17: Server Thread - Auto Configuration: Received a request for configuration from ::ffff:172.18.10.44
2016-12-15 05:58:17: Server Thread - Auto Configuration: Configuration sent
2016-12-15 05:58:17: Server Thread - Auto Configuration: Received packet on autoconfig port
2016-12-15 05:58:17: Server Thread - Auto Configuration: Picking configuration based on: LAPRO1797 / ::ffff:172.18.10.44
2016-12-15 05:58:17: Server Thread - Auto Configuration: Match found for rule Permission Check Disable (IPMatch)
2016-12-15 05:58:17: Server Thread - Auto Configuration: Match found for rule Disable Slave Startup Launch (IPMatch)
2016-12-15 05:58:17: Server Thread - Auto Configuration: Created a configuration worth sending
2016-12-15 05:58:17: Server Thread - Auto Configuration: Received a request for configuration from ::ffff:172.18.10.44
2016-12-15 05:58:17: Server Thread - Auto Configuration: Configuration sent
2016-12-15 09:10:12: Server Thread - Auto Configuration: Received packet on autoconfig port
2016-12-15 09:10:12: Listener Thread - ::ffff:172.18.4.58 has connected
2016-12-15 09:10:12: Server Thread - Auto Configuration: Picking configuration based on: LAPRO1706 / ::ffff:172.18.4.32
2016-12-15 09:10:12: Server Thread - Auto Configuration: Received packet on autoconfig port
2016-12-15 09:10:12: Server Thread - Auto Configuration: Match found for rule Permission Check Disable (IPMatch)
2016-12-15 09:10:12: Server Thread - Auto Configuration: Match found for rule Disable Slave Startup Launch (IPMatch)
2016-12-15 09:10:12: Server Thread - Auto Configuration: Created a configuration worth sending
2016-12-15 09:10:12: Server Thread - Auto Configuration: Received a request for configuration from ::ffff:172.18.4.32
2016-12-15 09:10:12: Server Thread - Auto Configuration: Picking configuration based on: LAPRO1777 / ::ffff:172.18.10.24