As of 7.1, it seems that the slave doesn’t always kill its child process tree properly when it shuts down in the middle of a running task. We are seeing instances of this more or less daily.
Today, I have logs showing the slave starting a task (Maya render) around 8:01, and receiving a shutdown command at 8:05. However, rather than everything behaving as it should, the machine is left without a slave running, but with an orphaned process tree and Maya still rendering as if nothing had happened (Maya is the 2nd or 3rd process in our execution tree). Interestingly, the log of the slave shutting down is in a separate log file from the task it was working on at the time.
Just to throw in a “me too”, I see orphaned mayabatch processes running frequently after the task is done and the slave is idle or the slave has been forced to shut down or restart via remote. Those orphaned mayabatch processes are usually on a single processor and running at 100%. If a slave has had several remote restarts, there may be several mayabatch processes each one eating up one entire processor.
It seems to be a problem on our OS X machines, but not on our Linux machines.
Deadline 7.1.2.1 OS X 10.10.3
Well, we’re supposed to walk the process tree and kill everything from the Slave down by sending SIGTERM and then SIGHUP if that times out.
There used to be problems with older versions of Deadline on OS X, but our process walking code should be fixed in 7.1.
I did some poking around, and it looks like if an app crashes or closes, there’s no guarantee that its children will die. I want to think we set up process groups on Windows so when the Slave exits it’s handled that way.
Can one or both of you send me a relevant Slave log at support@thinkboxsoftware.com so we can see if it’s trying to kill things properly? I may need to put some debug code in our process exiting to find if we’re walking that tree properly and sending the right signals.
Sure, but there’s nothing useful in them, so I’m just going to share them here.
The first log from that day (deadlineslave-ws-023-2015-07-30-0000.log) ends with output from a render task that was in progress. The last timestamp is 2015-07-30 08:02:03.
Then there are two log files after that that look mostly identical (but they’re basically at the same time):
deadlineslave-ws-023-2015-07-30-0001.log
2015-07-30 08:05:34: BEGIN - ws-023.luma.mel\qmaster
2015-07-30 08:05:34: Deadline Slave 7.1 [v7.1.2.1 R (53013d5bd)]
2015-07-30 08:05:35: Slave - Final cleanup
2015-07-30 08:05:35: Slave - Shutdown complete
deadlineslave-ws-023-2015-07-30-0002.log
2015-07-30 08:05:35: BEGIN - ws-023.luma.mel\dulshank
2015-07-30 08:05:35: Deadline Slave 7.1 [v7.1.2.1 R (53013d5bd)]
2015-07-30 08:05:35: Slave - Final cleanup
2015-07-30 08:05:35: Slave - Shutdown complete
Shouldn’t that be SIGKILL instead of SIGHUP?
Also, are you enumerating the children all at once, or doing it iteratively as you kill them? If it’s the latter, isn’t there the possibility that the child’s children will be re-parented before you can discover them?
For the logs, I’m looking for whether Deadline is trying to kill anything, and it should print out a countdown timer if I’m remembering correctly. All I’ve got from these is they start and stop… That leads me to believe the Slaves think another Slave is already running thanks to the slave.ini file in the Slave data folder (“/var/lib/Thinkbox/Deadline7” on Linux if others are wondering). It’ll close immediately in that case.
The slave is being stopped by running dealineslave -s, so as I understand it, those two short logs are from the external slave processes sending shutdown commands to the running slave. However, the actual slave log doesn’t have any information about receiving a shutdown command, killing processes, etc. so it would seem there’s either a logging problem or a logic problem in the shutdown code.
Well, the logging for closing processes and things would be in the Slave log which is doing the rendering. It should say something about receiving a command to exit. Testing now.
Update:
Here’s what I see
2015-08-06 11:29:32: Scheduler Thread - Seconds before next job scan: 1
2015-08-06 11:29:33: Scheduler - Performing Job scan on Primary Pools with scheduling order Pool, Priority, First-in First-out
2015-08-06 11:29:33: Scheduler - Preliminary check: Slave is not whitelisted for 54f617a3bb68052c6c282729 limit.
2015-08-06 11:29:33: Scheduler - Preliminary check: Slave is not whitelisted for 55b706064da76f1a70751b7f limit.
2015-08-06 11:29:33: Scheduler - Job chooser found no jobs.
2015-08-06 11:29:33: Scheduler - Performing Job scan on Secondary Pools with sc
heduling order Pool, Priority, First-in First-out
2015-08-06 11:29:33: Scheduler - Job chooser found no jobs.
2015-08-06 11:29:34: Scheduler Thread - Seconds before next job scan: 1
2015-08-06 11:29:35: Listener Thread - ::ffff:192.168.19.178 has connected
2015-08-06 11:29:35: Listener Thread - Received message: StopSlave
2015-08-06 11:29:35: Slave - slave shutdown: normal
2015-08-06 11:29:35: Info Thread - requesting slave info thread quit.
2015-08-06 11:29:35: Scheduler Thread - shutdown complete
2015-08-06 11:29:35: Loading event plugin Draft (/mnt/dev7/events/Draft)
2015-08-06 11:29:35: Loading event plugin Shotgun (/mnt/dev7/events/Shotgun)
2015-08-06 11:29:36: Waiting for threads to quit. 29 seconds until forced shutdown.
2015-08-06 11:29:37: Waiting for threads to quit. 28 seconds until forced shutdown.
2015-08-06 11:29:38: Waiting for threads to quit. 27 seconds until forced shutdown.
2015-08-06 11:29:39: Info Thread - shutdown complete
2015-08-06 11:29:39: Waiting for threads to quit. 26 seconds until forced shutdown.
2015-08-06 11:29:40: Slave - Final cleanup
2015-08-06 11:29:40: Listener Thread - OnConnect: Listener Socket has been closed.
2015-08-06 11:29:40: Slave - Shutdown complete
Now, Deadline 7.2 has a fix for a thread locking issue that would eat extra time for concurrent tasks. I’m expecting to see something interesting from either of you folks after the “2015-08-06 11:29:35: Listener Thread - ::ffff:192.168.19.178 has connected” line. Maybe the counter going down to zero showing that we’re not closing things the right way.
I have none of the shutdown-related output (“Listener Thread”, etc.) in my slave log. The last line is a STDOUT line from the task it was running at the time.
If the Slave isn’t doing the countdown, it may be being killed in a not so nice way. I’m going to try and reproduce here. How exactly is the Slave being restarted? Is it a shell script, a signal from kill?