AWS Thinkbox Discussion Forums

Dependency Jobs Not Releasing

Hello, I am wondering if someone can help me out with Dependent jobs getting stuck in deadline.

I have Batch or render jobs that have a fairly basic render dependency. When the first job in the chain completes the next job never release. I have run the dependency test tool and everything is green. Also, the farm does have some bandwidth we have nodes that are sitting idle while this event happens. If we go in and release the jobs they go through without a problem. This is not something that happens every time, but it happens often enough that we can say its a problem.

What else should I be looking at to troubleshoot this? Is there a setting that I am missing in the submission parameters that would help us avoid this issue?

JobA1 > jobA2(This job is only depent on the job before it) >>>>>>>>>
JobB1 > jobB2(This job is only depent on the job before it) >>>>>>>>> finaljob (clean-up job that runs after all the other jobs have been completed.)
JobC1 > jobC2(This job is only depent on the job before it) >>>>>>>>>
JobD1 > jobD2(This job is only depent on the job before it) >>>>>>>>>

I know we had issues back pre-7.0 where the pending job scan would wait for an on-disk lock that would never release. We moved the lock into the database to make it more reasonable.

One thing I suppose would be to try and do the dependency check in the Monitor to see if it fails. You can find it under ‘Tools’ while in super user mode.

Usually these days it boils down to an event script tying up the repository scan. I usually debug these by running Pulse on some system so it takes over the task of repo repair and pending scans so I can watch the logs and figure it out. Enable ‘verbose logging’ in the Repo config under the ‘Application data’ section of ‘Configure Repository options’.

Feel free to paste log snippets and we’ll dig in here.

Hey Edwin

I did do the dependency check in the Monitor and the jobs that will not release are all showing green.

The dependency check is a different code path which just shows a UI. The “perform pending job scan” actually runs the pending job scan code and outputs everything to the Monitor’s log. It should say why it’s failing.

[attachment=1]2018-05-30 09_41_27-Deadline Monitor.png[/attachment]

[attachment=0]2018-05-30 09_42_33-.png[/attachment]

Hey Edwin

We have a few more jobs that are starting to have this problem, now Tile Assembly jobs are not releasing automatically. I ran the scan as you said and we go no errors and the jobs release as I would expect them to do (I have copied the result below). Is there a flag that I am missing in the settings to get to this run automatically again?

Pending Job Scan Interval is set to 60 and the only flag that we have on is Allow Slaves to Perform the Pending Job Scan if Pulse is not Running.

We are currently running
Repository Version: 10.0.16.3
Integration Version: 10.0.16.3

2018-06-01 08:54:04: Pending Job Scan - Released 140 pending jobs and 0 pending tasks in 17.943 s
2018-06-01 08:54:04: Pending Job Scan - Done.
2018-06-01 08:54:04: Processing Pending Job Events
2018-06-01 08:54:04: Pending Job Events - Checking for pending job events…
2018-06-01 08:54:04: Pending Job Events - Processing 0 job events
2018-06-01 08:54:04: Pending Job Events - No more job events to process
2018-06-01 08:54:04: Pending Job Events - Done.

Hmm. Normally the Slaves perform this operation, and I tested SP16 with that on Friday for a different reason (practicing asset dependencies) and it worked fine there…

Are you running Pulse anywhere? You can check by creating a Pulse panel in the ‘view’ menu of the Monitor. If the Slaves think Pulse is running for some reason they won’t run the house cleaning operations. If it is running, I would check it’s logs for what it’s doing with house cleaning. Here’s info on how to find those logs:
docs.thinkboxsoftware.com/produ … html#pulse

If it’s not running, that gets much more difficult as it could be one or more Slaves are failing to complete it for some reason (they roll a virtual die to see if they should do house cleaning). You can try stopping your current Pulse and running Pulse on one of the render nodes to see if it’s related to the machines somehow.

Privacy | Site terms | Cookie preferences