Slaves keep going offline on machines by itself

Hi all,
this problem has pushed me to make an account in order to learn more about deadline.

we are having this strange behaviour where the slave on our 24 hour node will just go offline with out anyone
stopping them. it’s happening to 3 or our 24 hours rendering machines at the moment and I don’t know enough about dead line to track it down.

we are using deadline 9 at the moment and the lines a get from the render log is

2019-10-16 11:22:22: 0: STDOUT: [Redshift] Allocating VRAM for device 1 (GeForce RTX 2070)
2019-10-16 11:22:22: 0: STDOUT: [Redshift] Redshift can use up to 5183 MB
2019-10-16 11:22:22: 0: STDOUT: [Redshift] Fixed: 337 MB
2019-10-16 11:22:22: 0: STDOUT: [Redshift] Geo: 3555 MB, Tex: 63 MB, Rays: 1037 MB, NRPR: 262144
2019-10-16 11:22:22: 0: STDOUT: [Redshift] Done! ( 58ms). CUDA reported free mem: 593 MB
2019-10-16 11:22:30: 0: STDOUT: [Redshift] Primary: Computed 32430, Re-used 0
2019-10-16 11:22:40: 0: STDOUT: [Redshift] Secondary: Computed 65536, Re-used 0
2019-10-16 11:22:47: 0: STDOUT: [Redshift] Secondary: Computed 45660, Re-used 0
2019-10-16 11:22:49: 0: STDOUT: [Redshift] Primary: Computed 5167, Re-used 0
2019-10-16 11:22:59: 0: STDOUT: [Redshift] Secondary: Computed 63316, Re-used 0
2019-10-16 11:23:02: 0: STDOUT: [Redshift] Primary: Computed 8214, Re-used 0
2019-10-16 11:23:11: 0: STDOUT: [Redshift] Secondary: Computed 65536, Re-used 0
2019-10-16 11:23:18: 0: STDOUT: [Redshift] Secondary: Computed 46034, Re-used 0
2019-10-16 11:23:21: 0: STDOUT: [Redshift] Primary: Computed 12433, Re-used 0
2019-10-16 11:23:30: 0: STDOUT: [Redshift] Secondary: Computed 65536, Re-used 0
2019-10-16 11:23:39: 0: STDOUT: [Redshift] Secondary: Computed 65536, Re-used 0
2019-10-16 11:23:48: 0: STDOUT: [Redshift] Secondary: Computed 60510, Re-used 0
2019-10-16 11:23:51: 0: STDOUT: [Redshift] Primary: Computed 18509, Re-used 0
2019-10-16 11:23:58: 0: Task timeout is 5400 seconds (Regular Task Timeout)
2019-10-16 11:24:00: 0: STDOUT: [Redshift] Secondary: Computed 65536, Re-used 0
2019-10-16 11:24:09: 0: STDOUT: [Redshift] Secondary: Computed 65536, Re-used 0
2019-10-16 11:24:18: 0: STDOUT: [Redshift] Secondary: Computed 65536, Re-used 0
2019-10-16 11:24:26: 0: STDOUT: [Redshift] Secondary: Computed 65536, Re-used 0
2019-10-16 11:24:35: 0: STDOUT: [Redshift] Secondary: Computed 64139, Re-used 0
2019-10-16 11:24:36: Listener Thread - ::1 has connected
2019-10-16 11:24:36: Listener Thread - Received message: StopSlave
2019-10-16 11:24:37: Slave - slave shutdown: normal
2019-10-16 11:24:37: Info Thread - requesting slave info thread quit.
2019-10-16 11:24:37: 0: Executing plugin command of type ‘Cancel Task’
2019-10-16 11:24:37: 0: Done executing plugin command of type ‘Cancel Task’
2019-10-16 11:24:37: 0: Done executing plugin command of type ‘Render Task’
2019-10-16 11:24:37: 0: In the process of canceling current task: ignoring exception thrown by PluginLoader
2019-10-16 11:24:37: 0: Unloading plugin: MayaBatch
2019-10-16 11:24:37: Info Thread - shutdown complete
2019-10-16 11:24:37: 0: Executing plugin command of type ‘End Job’
2019-10-16 11:24:37: 0: INFO: Ending Maya Job
2019-10-16 11:24:37: Scheduler Thread - Seconds before next job scan: 2
2019-10-16 11:24:37: Scheduler - Returning all limit stubs.
2019-10-16 11:24:37: Scheduler - returning redshift_by_machine
2019-10-16 11:24:37: Scheduler - returning 5da6e7a3d563403cf4929533
2019-10-16 11:24:37: 0: INFO: Waiting for Maya to shut down
2019-10-16 11:24:37: 0: INFO: Maya has shut down
2019-10-16 11:24:37: 0: Done executing plugin command of type ‘End Job’
2019-10-16 11:24:37: 0: Shutdown
2019-10-16 11:24:38: 0: Executing plugin command of type ‘Cancel Task’
2019-10-16 11:24:38: 0: Done executing plugin command of type ‘Cancel Task’
2019-10-16 11:24:39: 0: Exited ThreadMain(), cleaning up…
2019-10-16 11:24:40: Error occurred when checking sandbox stdout: Cannot check if stdout is available for a process that has not been launched.
2019-10-16 11:24:40: Waiting for threads to quit. 29 seconds until forced shutdown.
2019-10-16 11:24:40: Scheduler Thread - shutdown complete
2019-10-16 11:24:41: Waiting for threads to quit. 28 seconds until forced shutdown.
2019-10-16 11:24:42: Waiting for threads to quit. 27 seconds until forced shutdown.
2019-10-16 11:24:43: Slave - Final cleanup
2019-10-16 11:24:43: Listener Thread - OnConnect: Listener Socket has been closed.
2019-10-16 11:24:43: Slave - Shutdown complete

at 2019-10-16 11:24:36: Listener Thread - ::1 has connected and 2019-10-16 11:24:36: Listener Thread - Received message: StopSlave

why is this happening?

The worker is receiving a stop command:

2019-10-16 11:24:36: Listener Thread - Received message: StopSlave

This command is likely being set by the Deadline Launcher. I would take a look in its log for additional clues into why the command is being set.

Do you have Deadline Monitor > Tools > Configure Worker Scheduling or Configure Power Management setup?

Hey @cmoore

I found out the reason for this!!
:: is a local command so when it said ::1, I knew that something was calling the command locally, an option I didn’t want to think about because it was being done out of hours.

Finding this out made me sure that it was local so it had to be

  1. A ghost
    or
    2.B program.

We have auto deployments that’s timed running on PDQ, I looks at all deployments on PDQ around that time and found that our files and folders cleaner deployments ran every time deadline stopped.

Checked the steps of the cleaner and one of the commands was StopSlave, we have a start slaves command at the end but the deployment never gets to it, it hangs and fails on disc cleanup, this is another problem I have to fix. For now I have marked out disc cleanup on the script so the cleanup deployment can finish… bam! all good now. thanks for your help on this.