OnJobFailed event not triggering for "automatic failure detection"

Deadline version

I have an event plugin that listens to OnJobErrorCallback and OnJobFailedCallback. The error callback fires whenever a task produces an error, and the failed callback fires if I manually fail a task or fail it through something like RepositoryUtils.FailJob(...).

If a job is failed because it has produced overall too many errors (as per repository settings), my OnJobFailedCallback triggers as expected.

My problem is that I can’t catch a useful event when a job is automatically failed because a task has produced too many errors. Sometimes a task-error like that just leaves the job in a limbo state, but the ones I’m mostly interested in catching are jobs that have a single task plus a post-task.

If this task fails, Deadline knows enough to mark the post-task as failed, and this propagates up to marking the whole job as failed. All that is good, I’m just struggling with catching this event.

The last OnJobErrorCallback event reports the job as being Active. Which I suppose makes sense because it takes a moment or two before the job status changes. I also tried refreshing the job object via RepositoryUtils.GetJob, but no change.

Any advice?

Instead of trying to catch the error in OnJobError instead use OnJobFailed. The OnJobError callback is invoked immediately after an error is reported. This occurs before the Job would be marked as failed, and before the Worker reports an error in the rendering step.

The OnJobFailed callback is invoked immediately after a Job has failed or Tasks for the Job have failed. So you should be able to grab the task report with the error in it from that callback.

If you’re ever curious about the granular order callbacks happen in this section of the Event Plugin docs has you covered.

Thanks Justin,

Like I mentioned, the OnJobFailed event does not trigger in my scenario. As best I can tell, this does not trigger for events when a job is marked as failed because its only task has produced too many errors. Monitor marks it as failed, but the event never fires.

The event does trigger, however, if the job fails because of too many overall errors.

So my tentative conclusion is that Deadline does something like this:

  1. Task fails
  2. Job only has one task, plus post-task
  3. Mark post-task as failed too, since it will never complete
  4. All tasks in job has failed, mark job as failed

But back in reality, obviously I don’t know exactly how the Deadline logic works. My findings are consistent though: If a job with single task plus post-task fails because of a task-error threshold, my OnJobFailed event does not trigger.

Oh, I see. I thought you were expecting OnJobError to show the Job as Failed, which happens later.

In the Monitor under Tools->Job Settings->Failure Detection what have you got set for # of errors to fail a task and job respectively? And is it a specific plugin you’re seeing this with? Those are the two variables I think would affect testing this the most.

I wonder if we’re hitting some really specific edge case with the order these events are happening in. As OnJobFailed should go off every time a job enters the Failed state. But maybe the “all tasks are failed therefore the job is failed” logic is misbehaving.

Let me know!

So, our global defaults are

Job failed = 50 errors
Task failed = 10 errors

But I think this simple test below illustrates the issue I’m having.

Does not work:

Works as expected:

Both jobs are of status Failed, but the one that triggered because “all tasks are failed” did not call my OnJobFailed event.

PS: You might notice that I’m catching all possible events in this plugin while I’m testing. It’s just to make sure I haven’t actually gone completely insane while we figure this out.

I appreciate the anti-insanity steps here! You’re not nuts, this is borked.

It does work if instead of a post-job script you use a post-task script. At least on that’s the case.

That part of Deadline is outside of what can be patched though. So we’ll have to come up with some work-around till that fix goes live.

Thanks Justin.

If we’re in agreement that this actually isn’t working, I can stop trying to force it.

Since this is a very specific use-case on our farm, I can probably anticipate whether or not the task, and consequently the job, will fail from the OnJobError event based on some light maths.

I’m using a post-job script because I’m lazy, and the same script runs for jobs with single and multiple tasks :slight_smile:

Appreciate your time, cheers.

1 Like

Notice to future forum searchers:

I wasn’t able to find a way to query repository settings like global Failure Detection (possibly by design), so there was no way to dynamically calculate if a job was about to fail from OnJobError.

I ended up setting a flag on the jobs after sending notifications in OnJobFailed and using the OnHouseCleaning callback to look for Failed jobs without this flag. If found, notifications would be triggered for those jobs and the flag set.

Various half-hearted attempts are implemented to reset flags on the various started/resumed/requed triggers, but I have no doubt that double-notifications will be in my future. For now though, it works great. Albeit with a small delay between an undetected fail and house cleaning trigger.