AWS Thinkbox Discussion Forums

delete on complete vs dependencies

We have a peculiar racing condition.
Very frequently (hundreds of times a day), semi automated jobs are fired, that generate 4-5 daisy chained jobs. To reduce clutter, the jobs are set to auto delete themselves on complete. However, we would expect the dependencies to be handled before this auto deletion occurs.

We have the dependency tree like this:
JobA -> JobB -> JobC -> JobD

Sometimes what happens:
JobA is complete, but before it could notify JobB that it should release, it gets deleted. JobB never starts.

We are aware of the flags we can set to release pending jobs on the dependencies being deleted, but turning that on interferes with our cleanup efforts. There are legit failed jobs, that sometimes we mass delete, and simply turning that flag on would awaken a bunch of dependencies that should not be started.

Ideally, deadline handled the pending job handling before it handles job deletion (for the same job set), or do a round of dependency checks on the jobs being deleted before actually moving them, thus eliminating the racing condition.

Hey Laszlo,

Thanks for reporting this. The problem occurs because dependency checking is part of the pending job scan, and job cleanup is part of the housecleaning scan. Since these scans are done asynchronously of each other, it’s possible for this behavior to occur.

Probably the quickest solution is to move this job cleanup code from housecleaning thread to the pending job scanning thread, and do it after releasing any pending jobs. However, this would mean large batches of auto-delete or auto-archive could affect the interval at which pending jobs are released (which is why we split these up into separate threads in the first place).

The other option is to respect the “Only cleanup jobs that have not been modified for this many hours” option in Repository Options -> Job Settings -> Cleanup -> Automatic Job Cleanup. If we default this to something like 2 hours (it’s currently 0), that should ensure that pending jobs are released before the jobs are deleted.

Thoughts?

Cheers,
Ryan

What if the cleanup would do a scan for job dependencies during its deletion process, and released jobs as its cycling through the deletion process?

The tweaking of the autocleanup hours is still not a robust solution, because each users might have different intervals / settings for the pending job scans

I think what we’ll do is simply check completed jobs to see if other pending jobs are dependent on them, and only delete the completed jobs if there aren’t any. It would only care about pending dependents, or active dependents that have pending tasks (due to frame dependencies). If the depenents are fully active, complete, failed, suspended, etc, then the completed jobs can be deleted.

This way, we aren’t duplicating the actual dependency checking the pending job scan performs.

Cheers,
Ryan

That’s a great idea, so basically simply delay deletion till the dependency handler did its job. That would work for us!

Privacy | Site terms | Cookie preferences