Task Based Event Plugins

Gary · December 3, 2019, 3:50pm

Hi there. I searched around and found a few places asking about this but they were mostly years old so I thought I’d check in.

I’m trying to make a script that can run in response to tasks failing that could adjust a Job’s settings to reduce load on the server farm. The intended workflow being:

Task fails
Script checks job’s reports to see what’s causing the failures
If enough resource based failures have occurred, reduce “Concurrent Tasks” value or “machine limit” value.

However, it seems that all the event plugins only run in response to job events, not tasks. If that is the case I have only 2 options.

Respond to job failure instead
Set up a post task script that checks every time a task has completed.

My main concern with 2 is pretty simple, that would be a constantly running check even when nothing whatsoever is going wrong. That would mean every completed task needs to run this piece of code (reworked for tasks)

def OnJobFailed( self, job ):
    # Make sure we have the latest job info
    self.LogInfo( "Job failed, checking error log." )
    report_collection = RepositoryUtils.GetJobReports(job.ID)
    reports = report_collection.GetErrorReports()
    
    self.LogInfo( "Found reports:" )
    resource_errors = 0
    resource_reports = []
    for report in reports:
        self.LogInfo(report.ReportMessage)
        if report.ReportMessage in self.ERROR_FLAGS:
            resource_errors += 1
            resource_reports.append(report)
    return
    
    if resource_errors <= self.ERROR_LIMIT:
        self.LogInfo("Too few errors to take action.")
        return

I don’t know the internals of performance, but is this a meaningless workload? Even if it takes a few seconds, it constantly happening on every single task feels like it would build up a lot of delays. And we definitely have some slow interactions with our repository (though admittedly the ones I’m thinking of are about communicating externally through the commandline)

If this would be a meaningless load, I can try it out.

And my concerns with 1 are a little more complicated. If a job fails, it interrupts all other running tasks even if they’re in progress. This wastes work that might otherwise complete successfully because of the fix I’m trying to implement. I would also need to set the error limit very low for it to be as responsive as if it was just checking on individual task failures.

Note that I know my own script has a section that responds to “number of errors”, but that’s because I’m specifically checking against certain error messages. There are other errors that are unrelated to resources that this script needs to ignore for its purposes.

To sum up I’m interested in advice/arguments about why I should choose one of the 2 solutions above (and maybe ways to address my concerns with them). But also, is there any reason that there aren’t task based events? It seems like a lacking piece of customisation that isn’t adequately addressed by Pre/postjob scripts for the reason I listed above (but I’m happy to be proven wrong there).

Justin_B · December 3, 2019, 4:26pm

Hello!

OnJobFailed should trigger whenever tasks fail as well as when jobs fail based on this bit in the docs. So with your existing snippet you should be in a good place to fire whenever a job or task error. I think that means we can avoid worrying about point #2 for now.

But just for future readers sake, any notable slowness for #2 would be happening in the script not in Deadline. So if it’s really easy to bail out (say you check the plugin type before doing the hard work) you’re in good shape. In this case, having to loop through the reports to find specific strings is going to be a bit slower. How much slower? That’s a bit harder to answer, as it comes down to how many reports to go through, how long each report is and how efficient the python is. And I can’t guess what that looks like in ‘seconds-on-the-clock’ slowness.

And someone in here was working on something sort of similar over in this thread, which might be something you’re interested in: Timeout limit lowers the priority job when reached. Is that possible?

Hopefully that’s shed some helpful light!

Gary · December 3, 2019, 5:09pm

Hey Justin! Thanks for the quick answer. The odd thing is that I’m not having that response on Task fails, only job fails. We are on Deadline 10.0, is this possibly a change in the update to 10.1?

OnJobFailed When: This callback is invoked immediately after a Job has failed or Tasks for the Job have failed.

By: The application that issues the Job to fail is responsible for invoking the event callback.

This text is a bit ambiguous though, I’m not clear on if it means any individual task fails or something else more specific about all of the tasks failing.

Tom_Saunders · April 2, 2021, 2:17am

This is a great feature request and something we’ve been wanting as well!

Post and pre-task scripts work fine as substitutes for the following events:

OnTaskStarted
OnTaskFinished

But they don’t cover other scenarios, such as:

OnTaskPended
OnTaskSuspended
OnTaskRequeued
OnTaskResumed
OnTaskErrored
OnTaskFailed
OnTaskReleased

Additionally, Event Plugins allow us to run more than one operation on events in a modular way, rather than putting all the logic into one post-task script (as there is only one slot per job).

Has this been discussed as a possible item on the product roadmap?

Thanks to the poster and the Thinkbox team