Auto Restart Bad Slaves?

We’ve been having trouble with slaves hanging after successfully completing several tasks. Simply restarting the slave always seems to clear the problem. Is there a way to auto restart a slave after it accumulates the number of errors required to label it as “bad?” So far, I’ve only been able to do it manually through the remote control scripts. If it’s not already built in somehow, can I set up some kind of trigger that will run that script? I’m hoping I’m just missing something.

I should also mention that these slave aren’t “stalled,” they just aren’t making any progress (the renderer not sending any updates) and eventually get timed out, either with auto task timeout or with a manually set max task render time. If it was straight forward stalling, I think the Repository Auto Configuration option to “restart slave if it stalls” would do the trick.

Ideally, we’ll figure out what’s causing the hang, but in the meantime, a simple slave auto restart would save an awful lot of headaches. Sometimes I’ll check on a job and find half the slaves have gone bad, wasting several hours of render time. I’ve been setting my alarm to wake me up at 2:00 a.m. just so I can check for bad slaves.

We are running Deadline 7.121, with the servers on linux, the slaves are running on a mix of local Mac OS and remote linux virtual machines on Amazon. The problem jobs happen to be MayaBatch renders, if that makes any difference.

Thanks,

  • Frank

Hey Frank,

The only built-in automatic thing we have that’s close to this would be the Machine Restart feature of Power Management:
docs.thinkboxsoftware.com/produc … ne-restart

However, obviously, that restarts the entire Machine, not just the Slave. It’s also solely based on time, and does not look at errors generated by the Slaves or anything like that.

If you want to do a scripting solution, you could look at creating an Event Plugin for this ( docs here: docs.thinkboxsoftware.com/produc … ugins.html). We don’t have a ‘onSlaveError’ callback, though, so you might have to use the ‘OnSlaveStartingJobCallback’, ‘OnSlaveRenderingCallback’, or ‘OnSlaveIdleCallback’ and do some checks in there to see if the Slave has been accumulating lots of errors.

Here’s a quick snippet of what that might look like:

from Deadline.Events import *
from Deadline.Scripting import *

def GetDeadlineEventListener():
    return MyEvent()

def CleanupDeadlineEventListener( deadlinePlugin ):
    deadlinePlugin.Cleanup()

class MyEvent( DeadlineEventListener ):
    def __init__( self ):
        self.OnSlaveStartingJobCallback += self.OnSlaveStartingJob

    def Cleanup( self ):
        del self.OnSlaveStartingJobCallback

    def OnSlaveStartingJob( self, slaveName ):
    	#Check to see how many Tasks the Slave has failed in this session (FailedTasks should reset on slave restart)
    	slaveInfo = RepositoryUtils.GetSlaveInfo( slaveName, False )
    	failedTasks = slaveInfo.SlaveFailedTasks
    	
    	if failedTasks > 50:
    	    	#Restart the slave
    	    	ClientUtils.LogText( "Restarting Slave due to error accumulation > 50" )
    	    	SlaveUtils.SendRemoteCommand( slaveInfo.MachineRealName, "RelaunchSlave" )

Disclaimer: I’ve not actually run the above code, so there might be typos in there, but it should at least give you a good idea of how to proceed if you want to go this route. You could also make a similar script that loops through all the current slaves, without going through the trouble of making an Event Plugin, and just periodically run it through “DeadlineCommand -executescript <path/to/script.py>” via a cron job/scheduled task.

Cheers,
Jon

Thanks, Joh, that’s at least a good place to start.

Might there be a script that controls the “Restart Slave If It Stalls” function that’s built into the Configure Repository Options, Auto Configuration gui that I could also reverse-engineer? I’m hoping I can start with that, then swap out the “stalled” status check with the equivalent of “failed increment >5” to force the restart. I’ve been looking through the various script subfolders in the Deadline Repository directory, but haven’t had luck finding it.

I’m a total-duct-tape-and-bailing wire script writer, piecing together stuff that mostly works. It’s not pretty. Thank goodness nearly every function in Maya is run from an accessible script. If it’s beyond me, I’ll have to try to get one off our real coder guys to tackle it.

Thanks,
Frank

Unfortunately, that stuff isn’t scripted; the Stalled checks and re-starting operations for that feature are built into the Launcher application itself.