Execute script when a slave has been stalled for XX minutes

anon60491448 · December 3, 2014, 12:05pm

This might be easily doable, but I couldn’t find anything in the manual.

What I want is to get Deadline Monitor to detect when a slave has been stalled for XX minutes, if it has, execute a Python script. Any ideas on how to do this?

rrussell · December 3, 2014, 2:34pm

Do you need to wait XX minutes after the slave has been marked as stalled? The reason I ask is that in Deadline 7, we introduced a new OnSlaveStalled trigger to our event plugin system that fires when a slave is marked as stalled. So if you don’t need the delay, you can use this to respond to a stalled slave immediately.

Cheers,
Ryan

anon60491448 · December 9, 2014, 10:54am

Thanks for your reply, I guess the timer isn’t too important. However I’ve tried to get this to work with the OnSlaveStalled trigger, but as there’s no documentation on this trigger yet I haven’t been able to figure out how to use it properly. I’m guessing the slave that’s stalled is passed on as a string, right?

[code]def GetDeadlineEventListener():
return StalledEventListener()

class StalledEventListener (DeadlineEventListener):

def OnSlaveStalled(self, slave):
    pathString = "//Computer/TestShare/FolderName/"
    dirName = os.path.dirname(pathString )
    os.mkdir(dirName)[/code]

So this should create a folder on the share when a slave stalls, but nothing seems to happen. I know that the code works, I’ve tried it on other triggers. Could you elaborate a bit on how to use this trigger?

rrussell · December 9, 2014, 2:21pm

You can find the documentation for Deadline 7 here (we link to it from same posts where you found the download links):
thinkboxsoftware.box.com/s/es28xfk1evsnfubcijlz

You’ll find the general Event Plugins documentation in the User Manual under Scripting, and you’ll find the supported triggers in the Scripting Reference documentation.

In your code, you’re missing the init constructor to hook up the OnSlaveStalled callback. You’ll also want to add the cleanup code:

def GetDeadlineEventListener():
    return StalledEventListener()

def CleanupDeadlineEventListener( eventListener ):
    eventListener.Cleanup()

class StalledEventListener (DeadlineEventListener):

    def __init__( self ):
        self.OnSlaveStalledCallback += self.OnSlaveStalled

    def Cleanup( self ):
        del self.OnSlaveStalledCallback

    def OnSlaveStalled(self, slave):
        pathString = "//Computer/TestShare/FolderName/"
        dirName = os.path.dirname(pathString )
        os.mkdir(dirName)

Hope this helps!
Ryan

anon60491448 · December 9, 2014, 2:58pm

I had the init initially, I must have pasted the wrong code in the previous post. However it still doesn’t work, nothing happens

Another thing, is it possible to force a node to stall? Now I just unplug the network and wait for 10 minutes for each test Thanks for the docs

rrussell · December 9, 2014, 3:09pm

Is the plugin enabled? You would need this line in your event plugin’s dlinit file:

Enabled=true

Unfortunately, there is no way to quickly test this. We want to eventually develop a testing environment for Deadline’s plugins, but it’s one of those things that hasn’t made its way on to a roadmap yet.

In the meantime, you could reduce the stalled slave interval to 5 minutes instead of the default 10 minutes. This can be found in the Slave Settings page in the Repository Options.

Cheers,
Ryan

anon60491448 · December 9, 2014, 3:38pm

Yes, it is enabled. As I said, it works with other events, just not the OnStalledSlave. Could be that the build is too old? (7.0.0.39 R)

Reducing the slave interval helps, it won’t really be a problem once I get it to actually trigger once a slave stalls though. I’ve zipped and attached the folder with the files, I’d appreciate it if you could have a look
StalledSlaves.zip (1.89 KB)

rrussell · December 9, 2014, 3:53pm

I dropped your plugin in our repository, and it triggered for us.

The OnSlaveStalled event was added back in 7.0.0.33, so your version should be fine.

Do you guys run Pulse? If so, can you ensure that Pulse is the correct version?

Cheers,
Ryan

anon60491448 · December 9, 2014, 4:04pm

That’s odd Could it be that the event doesn’t trigger if the slave has lost contact? That you have to overload it instead of unplugging the network cable?

Yeah Pulse is running the same version.

rrussell · December 9, 2014, 6:11pm

The event is triggered by the application (Pulse in this case) that marked the slave as stalled, so it doesn’t matter if the slave itself can connect to the DB or not.

In the Pulse log during a Repository Repair operation, you should see something like this:

2014-12-09 12:09:21: Performing Stalled Slave Scan... 2014-12-09 12:09:21: Stalled Slave Scan - Loading slave states 2014-12-09 12:09:21: Stalled Slave Scan - Loaded 20 slave states in 7.006 ms 2014-12-09 12:09:21: Stalled Slave Scan - Scanning slave states 2014-12-09 12:09:21: Stalled Slave Scan - Cleaned up 2 stalled slaves in 26.019 ms 2014-12-09 12:09:21: Stalled Slave Scan - Done.

Maybe look for this in the Pulse log and see if there were any error messages from your event plugin.

I should mention that we are testing with our internal build of RC5, so it’s very possible something was fixed between your version and ours, but we weren’t aware of any bugs related to this. RC5 should be available this week, so maybe you can upgrade to it after it is released and let us know if you still have this problem.

Cheers,
Ryan