I’m wondering if any thought has been put into allowing all queue events to be handled asynchronously. In developing an internal event plugin, I’ve noticed that callbacks bound to the OnJobDeletedCallback are executed by the monitor process when a job is deleted through the monitor. My guess is that this same behavior would apply to the OnJobPendedCallback, OnJobRequeuedCallback, OnJobResumedCallback, etc. for operations performed through the monitor.
This is less than ideal for a couple of reasons:
The callback code may be non-trivial, and deleting a large number of jobs could render a machine useless for some time, especially if the events are handled synchronously on that machine. I’m already seeing a noticeable delay when deleting a single job from the queue.
Machines viewing and managing the queue may not all be configured in a way that would allow them to properly execute said callback code.
With this in mind, is it possible to only allow certain machines to handle events (e.g. only machines with active slaves)?
P.S. If I submit or delete a job using the REST API, where is the event listener executed?
We have thought about this, and we are considering adding a background thread to the Monitor to process these sorts of things. If someone tries to close the Monitor while this background thread’s queue isn’t empty, it would pop up a warning indicating that there are still events that need to be processed.
Another solution could be to add an event collection to the database, which would act as a FIFO queue. Whenever an event would be triggered, it would be added to this collection, and then Deadline’s housecleaning operation would pop them off.
I like this idea a lot. I think it would be good to limit the event handling to certain machines (perhaps slaves only).
As far as OnJobStarted, OnJobFinished, and possibly OnJobError go, are these always executed on the slave that was first to pick up the job, last to finish a task in a given job, and that executed the errored task, respectively? If so, it may be beneficial to leave events of this type the way they are, though I realize this would divide events into two execution patterns.
I think having a split might make sense. I agree that the ones you mentioned should be handled the way they currently are. I think OnJobSubmitted should be handled the way it currently is as well, since we want it to be synchronous in case the event plugin is setting job properties as it is being submitted.
We also want to look at expanding the event system more, including events for when tasks change states. We haven’t done so yet because of the impact this would currently have on the Monitor’s performance, but by having a queue in the database for them, that is no longer an issue.
I was thinking about this some more, and if support is added for working with remote repositories without having the remote server mounted, it will become even more important to confine the pool of machines handling the events to members of the actual farm (slaves, Pulse, etc.), in case the event handler code needs to interact with the local filesystem.
Just out of curiosity, is this the kind of change that might make it into 7.0, or are you guys pretty much locked in on major new features and changes for 6.3/7.0?
It’s possible we could do it for 7.0, but we can’t guarantee at this point. We’ve had some internal discussions on how this might work, and now we’re looking at the development that would be required. We’ll probably know one way or another in a few weeks.
Cool, just curious. In the meantime, I’ll probably end up implementing my own external queue for cleaning up deleted jobs to avoid the blocking nature of the OnJobDeleted event.
Just an update. It looks like we’ll be able to get the async system into version 7. All job events will be processed asynchronously except in these cases:
OnJobSubmitted: This will still be processed by the submission process (ie: deadlinecommand).
OnJobStarted: This will still be processed by the slave that starts the job
OnJobFailed: If a slave fails a job, it will process the event. Manually marking a job as failed will go through the async system.
OnJobFinished: If a slave finishes a job, it will process the event. Manually marking a job as complete will go through the async system.