AWS Thinkbox Discussion Forums

Plugin/event callback for task timeout?

Can I get some kind of a callback on my job plugin or an event listener when a task is killed due to hitting the job’s task timeout limit (either auto-calculated or not)? I can’t find anything in the API docs.

Thanks

Hello Nathan,

I don’t believe it is currently possible for the timeout you want, but I will ask the devs to add that to the wishlist as I think it’s not logical to support this for some timeouts, but not others.

Thanks Dwight, this would be a great thing to have.

Just to add to this request a little, it would be great to be able to control the AbortLevel of the task when it times out, both at the job level and within the proposed task timeout callback. We don’t necessarily want tasks re-queueing when they time out.

Hello Nathan,

I will add that to the request. Thank you for adding it.

Just wanted to check the status of this request again, as the issue with timeouts automatically re-queueing can be a pretty big time-waster.

Hello,

I can’t seem to find the issue I made in May, so I have remade the issue. I will try to get some idea from the devs asap for you.

OK thanks. On a related note, is there a formal bug tracking portal anywhere?

Hello Nathan,

No, there is no external facing bug tracking at this time.

Regarding the initial issue, I got the following:

There is an OnJobError callback for event plugins:
docs.thinkboxsoftware.com/produc … 18212bd5cb
The third parameter it accepts is a Report object, and you could check its ReportMessage property to see if it’s a timeout error.
docs.thinkboxsoftware.com/produc … eport.html
The first and second parameters are the job and task object respectively, which you can use to control the state of the job and/or task. For example, if you wanted to fail the task, you could call RepositoryUtils.FailJobTasks().
docs.thinkboxsoftware.com/produc … 2397befcf2
Finally, timeouts for a job can be configured to throw an error (AbortLevel.Major), notify the user and requeue (AbortLevel.Minor), and complete the task (AbortLevel.Success). Perhaps all we need to do here is add an option to fail the task (AbortLevel.Fatal).

Let me know if this helps?

That’s exactly what I’m looking for, so it would be great if that extra option could be added (hopefully within the 8.x cycle). We want it to be configurable at submission time, but it would also probably make sense to have the default behavior be a repository configuration option.

Hello,

I have passed your thoughts along to the devs, and will get back to you when I have more on this.

So, Ryan Russell got this one done for you. Looks like it’ll be in the next 8.1 beta build. Plus there should be a few more options on what to do when things time out (Complete, Fail, Requeue) which should be great for the V-Ray DBR job types.

Would others have much use for an extra Timeout setting (eg. AbortLevel.Warning) where the job that hits its Timeout limit sends a notification/warning (which triggers a callback) and then continues to render?

We lose a lot time to jobs that narrowly hit their Timeout settings, with only 5% of the job remaining - however currently the best behaviour we can get is changing the Timeout settings so it does not happen again (using the Warning callback to trigger a custom event plugin). The jobs automatically requeue themselves when they Timeout and are set to ‘Warning’ (which is our best setting here).

The “Notify” setting is also close to what we need, but I can’t find if it triggers a Callback, and we ideally don’t want an email sent to our users every time we need to recalculate a Timeout.

Having an additional option, or a ‘Timeout Warning’ that triggers a callback, say, 5 minutes before the real Timeout, would give us the chance to recalculate the Timeout settings before the job either fails or re-queue’s itself and we lose the work.

I like this as a general feature idea. Right now, job plugins have an awareness of things like task progress, but event plugins have zero awareness of a job while it’s actually running (since they’re only invoked in response to external events). For people who use Deadline’s default job plugins, allowing them to drop in a custom event plugin to make dynamic changes to running jobs would provide more of a neatly modular approach to site customization (compared to, say, having to copy or modify a job plugin). I wonder if the right long-term solution would be to switch to a tag- or pattern-based event callback system, rather than only permitting a hardcoded set of callbacks.

We use a custom job plugin and internal process execution and monitoring code (running as another layer inside the Deadline task process) so that we can do things like abort tasks if they fail to produce any output within a certain period of time. Since this internal code is what parses out application-specific “progress-like” messages and homogenizes them for the slave to parse, we can also track progress updates, calculate the “velocity” of a task, and use it to project a task timeout, figure out if the timeout should be extended and by how much, etc.

This kind of thing is incredibly useful when trying to maximize resource efficiency (especially if you’re paying for CPU/license time in a partial or full cloud environment), but right now the level of customization required to get this sort of control is prohibitive to many users/facilities.

Can you elaborate a bit on how you envision that working? We’re definitely looking to table some longer-term items in the realm of how we handle Plugins, and I’m of the opinion that our current event plugin system is in most need of a facelift. It could also be a separate system altogether, either way I do think we need to provide a layer in between Deadline’s internal logic and the invocation of Plugin functions. Basically an intercept layer of some kind.

Just thinking out loud at this point…

For the callback system, some possible analogs to what it could look like might be a message queue (e.g. RabbitMQ), Redis’ pub/sub, or even RV’s event system. Then, in a custom job plugin, I imagine being able to do something like:

self.SendEventWithData('my-sweet-event-name', someObject)

In RV, events are just unique strings, which means there’s no need to declare a new event… you can just drop in a call like:

rv.commands.sendInternalEvent('my-custom-event-name', contents=someObject)

If anything wants to be able to respond to a given event, they can bind a handler function to it at any point using the same identifier. Bindings can even be done using regular expressions, so that a single handler can respond to many events without piecemeal binding.

In a system like this, I would really like to be able to enforce synchronous/asynchronous processing for a given event, ideally at call time rather than as a forward declaration. Obviously asynchronous event handling across systems could mean restricting the types used for event “contents”.

I’ll brain-dump more if I think of anything.

Privacy | Site terms | Cookie preferences