Deadline not releasing dependent jobs

Hi,

Deadline seems to have stopped releasing it’s dependent jobs over the last 24 hours, we’ve not changed anything in our repository settings. We run 60 slaves with a variety of different packages and we run pulse. If you open up the dependency window for the job and tick ‘resume on completed’ off and then on again the job releases.

Anyone else come across this issue and know how we might resolve it?

Nick Weeden

Hi Nick,

I wonder if Pule’s cache has become corrupted. Changing that dependency setting for each job is likely triggering a cache refresh, which could explain why it helps. Try restarting the Pulse application to see if that resolves the issue.

Cheers,

  • Ryan

I restarted pulse earlier this morning and it’s still not happy. We are getting a load of weird errors in pulse about saying something like :

"Minor exception - can’t load from disk, IOException: Error in file S:\DeadlineRepository… (Job has been corrupted, it is recommended that this job be removed from the repository: 000_050_00…) "

but I’m not sure how to resolve this because there’s loads of them!

Nick

Looks like there are some corrupt jobs in the repository. That’s concerning that there is a lot of them. I wonder if they’ve been building up over time, or if something happened over the last 24 hours to corrupt a bunch of them. After restarting Pulse, did you dependent jobs start getting released on their own again? It could be that the dependency and corrupt job problems are completely separated, or it could be that the corrupted jobs are causing the dependency problem.

The job id (000_050_00…) is printed out with each corrupt job, so you can use that to find them in S:\DeadlineRepository\jobs. You can either choose to just delete the corrupt job folders, or you could create a temporary directory in the repository root (S:\DeadlineRepository\jobs_temp) and move the corrupt job folders there. The latter is only necessary if you want to salvage the scene files that were submitted with the job. With the corrupt jobs out of the way, I would expect things to return to normal.

Cheers,

  • Ryan

After we restarted there was no difference, the files still weren’t released by deadline. We’ve also noticed that jobs set to delete on completion aren’t deleting ( this is from about 7.20 this morning ) could this also be linked into the problem?

I’ll try and go through all the corrupt jobs, I was hoping there would be any easier way as there’s loads of them!

Nick

That could be linked to the problem. There is a “housecleaning” thread in Pulse that releases dependent jobs, deletes jobs marked for completion, etc. Normally, a corrupt job shouldn’t give this thread any problems, because it just goes through one job at a time and if there is an error, it moves on. On the Pulse machine, go to the Pulse UI and select Help -> Explore Log Folder. Then find the most recent Pulse log and post it. We’ll take a look to see if there are other red flags that stand out.

Cheers,

  • Ryan

I just checked the housecleaning code, and corrupt jobs shouldn’t be an issue in preventing other jobs from being released for dependencies or auto-deleted. Hopefully the Pulse log will provide some insight…

I shutdown pulse and then removed the host name from the repository settings and then restarted pulse. It asked if I wanted to turn it on in the repository so I did. It’s still not deleting jobs on completion and it’s not releasing dependencies but it doesn’t seem to have printed out a huge list of corrupt files.

I’ve attached the log from this morning with the corrupt files, the list is about 6000 long…! I’ve looked in a couple of the folders and they contain a tasks folder and nothing else.

Nick

That’s good that Pulse isn’t complaining about corrupted jobs anymore, but I think it would be best to set up a gotomeeting with you so we can try and figure out why Pulse isn’t doing it’s normal housecleaning properly. I’ll respond to the ticket you had sent earlier regarding the subject to try and set something up.

Cheers,

  • Ryan

Just an update for anyone watching this thread.

The problem was that the JobRepositoryScan.lock file in \repository\jobs was locked, so Pulse always thought another process was doing the repository scan (which is responsible for releasing dependent jobs, auto-deleting jobs, etc). After deleting the file manually, the repository scan started working again.

Cheers,

  • Ryan
1 Like