Hi there. I searched around and found a few places asking about this but they were mostly years old so I thought I’d check in.
I’m trying to make a script that can run in response to tasks failing that could adjust a Job’s settings to reduce load on the server farm. The intended workflow being:
- Task fails
- Script checks job’s reports to see what’s causing the failures
- If enough resource based failures have occurred, reduce “Concurrent Tasks” value or “machine limit” value.
However, it seems that all the event plugins only run in response to job events, not tasks. If that is the case I have only 2 options.
- Respond to job failure instead
- Set up a post task script that checks every time a task has completed.
My main concern with 2 is pretty simple, that would be a constantly running check even when nothing whatsoever is going wrong. That would mean every completed task needs to run this piece of code (reworked for tasks)
def OnJobFailed( self, job ):
# Make sure we have the latest job info
self.LogInfo( "Job failed, checking error log." )
report_collection = RepositoryUtils.GetJobReports(job.ID)
reports = report_collection.GetErrorReports()
self.LogInfo( "Found reports:" )
resource_errors = 0
resource_reports = []
for report in reports:
self.LogInfo(report.ReportMessage)
if report.ReportMessage in self.ERROR_FLAGS:
resource_errors += 1
resource_reports.append(report)
return
if resource_errors <= self.ERROR_LIMIT:
self.LogInfo("Too few errors to take action.")
return
I don’t know the internals of performance, but is this a meaningless workload? Even if it takes a few seconds, it constantly happening on every single task feels like it would build up a lot of delays. And we definitely have some slow interactions with our repository (though admittedly the ones I’m thinking of are about communicating externally through the commandline)
If this would be a meaningless load, I can try it out.
And my concerns with 1 are a little more complicated. If a job fails, it interrupts all other running tasks even if they’re in progress. This wastes work that might otherwise complete successfully because of the fix I’m trying to implement. I would also need to set the error limit very low for it to be as responsive as if it was just checking on individual task failures.
Note that I know my own script has a section that responds to “number of errors”, but that’s because I’m specifically checking against certain error messages. There are other errors that are unrelated to resources that this script needs to ignore for its purposes.
To sum up I’m interested in advice/arguments about why I should choose one of the 2 solutions above (and maybe ways to address my concerns with them). But also, is there any reason that there aren’t task based events? It seems like a lacking piece of customisation that isn’t adequately addressed by Pre/postjob scripts for the reason I listed above (but I’m happy to be proven wrong there).