slave shutdown on network boot machines

egmehl · May 3, 2013, 4:30pm

I’m seeing a problem where nodes don’t register as being offline after I shut them down via the monitor (right click the machine name -> Remote Control -> Shutdown Machine). The actual shutdown happens, so the machine is off, but it still registers as online, and eventually stalled when it times out.

These are diskless Linux machines, network booting from a shared root, maybe that has something to do with it? Shutting down one node works fine - machine turns off and registers as offline in monitor. But the second, third, etc. only shutdown the hardware, not registering in the monitor.

Maybe a bug, maybe a problem with my configuration?

rrussell · May 3, 2013, 4:58pm

It sounds like something is preventing the slaves from shutting down cleanly. Normally when a slave is stopped, one of the last things it does is write its state to the database indicating that it is offline. However, if the slave is killed before that happens (or something else prevents the writing from occurring), then it would still show as online.

The next time this happens, can you start up the machine again and grab the last 2 launcher and last 2 slave logs (the most recent for each). On Linux, you’ll find the logs in /var/log/Thinkbox/Deadline6. Hopefully these logs will show what happened during shutdown.

Thanks!

Ryan

egmehl · May 7, 2013, 2:50pm

Looks like it might be the launcher on all nodes is getting shutdown with the first? In this case, node1 shut down properly and registered as such with the monitor. Nodes 2 & 3 shutdown but didn’t register with the monitor. Only node1 has a launcher log from today when I shutdown. All nodes have slave logs.

Here’s the launcher log from node1:

2013-05-07 09:31:30:  ::ffff:10.10.1.50 has connected
2013-05-07 09:31:30:  Launcher Thread - Received command: ShutdownMachine
2013-05-07 09:31:30:  Sending command to slave: StopSlave 
2013-05-07 09:31:30:  Got reply: node1.hypothetical.cgi: Sent "StopSlave" command. Result: ""

slave log from node1:

..... a bunch of maxwell writing logs.....

2013-05-07 09:11:24:  0: STDOUT: maxwell:[INFO] MXI successfully renamed.
2013-05-07 09:11:30:  0: STDOUT: maxwell:[INFO] Start writing image file...
2013-05-07 09:11:32:  0: STDOUT: maxwell:[INFO] End writing image file...
2013-05-07 09:11:32:  0: STDOUT: maxwell:[INFO] Benchmark of 122.084. Time: 20h20m32s. SL of 16.31
2013-05-07 09:11:32:  0: STDOUT: maxwell:[INFO] Time left: 3h39m36s
2013-05-07 09:21:12:  Scheduler Thread - Cancelling task because task "0_0-0" could not be found
2013-05-07 09:21:12:  Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
2013-05-07 09:21:13:  0: In the process of canceling current task: ignoring exception thrown by PluginLoader
2013-05-07 09:21:14:  Scheduler Thread - In the process of canceling current tasks: ignoring exception thrown by render thread 0
2013-05-07 09:31:31:  Info Thread - requesting slave info thread quit.
2013-05-07 09:31:31:  Info Thread - shutdown complete
2013-05-07 09:31:31:  Scheduler Thread - shutdown complete

slave log from node2:

..... a bunch of maxwell writing logs.....


2013-05-07 09:10:58:  0: STDOUT: maxwell:[INFO] MXI successfully renamed.
2013-05-07 09:11:03:  0: STDOUT: maxwell:[INFO] Start writing image file...
2013-05-07 09:11:05:  0: STDOUT: maxwell:[INFO] End writing image file...
2013-05-07 09:11:05:  0: STDOUT: maxwell:[INFO] Benchmark of 124.260. Time: 21h56m58s. SL of 16.54
2013-05-07 09:11:05:  0: STDOUT: maxwell:[INFO] Time left: 2h03m09s
2013-05-07 09:21:21:  Scheduler Thread - Cancelling task because task "0_0-0" could not be found
2013-05-07 09:21:21:  Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
2013-05-07 09:21:22:  0: In the process of canceling current task: ignoring exception thrown by PluginLoader
2013-05-07 09:21:23:  Scheduler Thread - In the process of canceling current tasks: ignoring exception thrown by render thread 0

slave log from node3:

..... a bunch of maxwell writing logs.....

2013-05-07 09:14:08:  0: STDOUT: maxwell:[INFO] MXI successfully renamed.
2013-05-07 09:14:14:  0: STDOUT: maxwell:[INFO] Start writing image file...
2013-05-07 09:14:15:  0: STDOUT: maxwell:[INFO] End writing image file...
2013-05-07 09:14:15:  0: STDOUT: maxwell:[INFO] Benchmark of 342.069. Time: 19h56m37s. SL of 18.80
2013-05-07 09:14:15:  0: STDOUT: maxwell:[INFO] Time left: 4h03m29s
2013-05-07 09:21:17:  Scheduler Thread - Cancelling task because task "0_0-0" could not be found
2013-05-07 09:21:17:  Scheduler Thread - The task has either been changed externally (and requeued), or the Job has been deleted.
2013-05-07 09:21:18:  0: In the process of canceling current task: ignoring exception thrown by PluginLoader
2013-05-07 09:21:19:  Scheduler Thread - In the process of canceling current tasks: ignoring exception thrown by render thread 0

rrussell · May 7, 2013, 5:57pm

Sorry, I just realized this was for network boot machines. That must have something to do with this strange behavior.

Is it possible that when one of these network boot instances shuts down, they all do?

Also, another thing to check is if MultipleSlavesEnabled is set to true or false in your system deadline.ini file (if it’s not there, the default is true). Here is the location of this file:
Mac: /Users/Shared/Thinkbox/Deadline6/deadline.ini
Linux: /var/lib/Thinkbox/Deadline6/deadline.ini
Windows: %PROGRAMDATA%\Thinkbox\Deadline6\deadline.ini

Cheers,

Ryan

egmehl · May 7, 2013, 8:34pm

That’s the strange part - each powers down fine, without taking any other machines with it.

I was missing that line in my deadline.ini, I’ll add it in and give it a try. Is that a file that I need to update when I do upgrades to restore the settings?

rrussell · May 7, 2013, 8:39pm

Cool. Let us know if it helps. If it doesn’t, we’ll have to do some more digging.

You shouldn’t have to update this file again after doing an upgrade. The client installer just updates the settings in the file it needs to, and since it doesn’t touch the MultipleSlavesEnabled setting itself, it should never change.

Cheers,

Ryan

egmehl · May 7, 2013, 10:20pm

That didn’t work either. Here’s my deadline.ini from /var/lib/Thinkbox/Deadline6

All the nodes login with the same user, does that matter?

[Deadline]
NetworkRoot=/srv/deadline6_repository
LicenseServer=nexus
LauncherListeningPort=17060
AutoConfigurationPort=17061
SlaveDataRoot=
LaunchSlaveAtStartup=false
MultipleSlavesEnabled=false

Thanks for your help.

rrussell · May 8, 2013, 1:20pm

Thanks for trying that. Can you check this folder on your system:

/var/lib/Thinkbox/Deadline6/slaves

Do you see a folder in here for every machine, or do you just see one? Even with MultipleSlavesEnabled set to false, I wonder if the launcher is going through every machine and killing the slave somehow…

We may end up having to set up a net boot environment here to try and reproduce the problem here. I’ll talk to our systems guy to see if we can get something set up.

egmehl · May 8, 2013, 7:28pm

I don’t see any folders there actually.

There is a single folder in the user home folder though:

~/Thinkbox/Deadline6/slave/node3

egmehl · May 29, 2013, 3:38pm

After revisiting this with a fresh install (for other reasons) I think I may have found the cause, but not sure what to do about it. I can only shut down the slave (via monitor) for the machine that most recently started, and when I try and shutdown the slave from the command line on any other node, I get this message:

$ ./deadlineslave -shutdown Checking slave startup lock file: /var/lib/Thinkbox/Deadline6/slaves/.lock Slave startup file is unlocked Checking port 51938 for slave

The lock file it refers to is emtpy, but there’s a .ini file that has this in it:

$ cat .ini ï»¿[Deadline] SlaveProcessID=1544 SlaveListeningPort=51938

And on the machine I was trying to shutdown, the slave PID is 1541, not 1544. So I’m guessing the last slave to start up overwrites any other slave’s info since this is a shared directory between all the slaves.

Now of course I have no idea what to do about it

rrussell · May 29, 2013, 4:00pm

Ah, yes, that would totally explain the behavior you’re seeing. Is it possible to have the /var/lib/Thinkbox/Deadline6/slaves folder to not use shared storage?

egmehl · May 29, 2013, 4:06pm

I think I could mount it as a ramdisk so it wouldn’t be shared.

Is there much data saved to that directory - like temporary job files that would take up multiple gigs of space? Does it matter if it is reset to empty after each boot (since the ramdisk would be cleared on each reboot)?

rrussell · May 29, 2013, 4:10pm

Very little is saved to that specific “slaves” folder. The job files are written somewhere else. Also, clearing that folder on reboot shouldn’t cause any issues.

egmehl · May 29, 2013, 4:29pm

oh yeah! That did it.

So for anyone else with this setup: just mount a tmpfs file system to /var/lib/Thinkbox/Deadline6/slaves, reboot and everything should be happy again.

rrussell · May 29, 2013, 4:32pm

Great, glad it worked, and thanks for sharing how to get it working!