Job suspend issues on high-latency connection

nrusch · July 23, 2014, 4:27pm

I just ran into a problem with attempting to suspend two large jobs on a high-latency connection (in our other studio).

Two jobs, one with 1045 tasks and one 778. Each was running with a single slave. I tried to suspend them one at a time. Only about 1/3 of the tasks actually ended up in the Suspended state, the job continued running, and its status counted up from “Rendering (1)” until it eventually hit “Rendering (6)”. The slaves just picked up tasks after the first 1/3 that had been suspended.

I tried suspending the jobs repeatedly, but nothing happened after that. Eventually, I just VNC’ed into the remote location, started a monitor there, and suspended the jobs. Everything worked fine.

I don’t know if this issue has been addressed or mitigated in Deadline 7, but it seemed like it was worth mentioning. It seems like Deadline may be aborting some database operations midway through based on some kind of a hard transation timeout limit (which I believe Laszlo mentioned having some issues with in another thread).

rrussell · July 23, 2014, 5:22pm

We are aware of this issue, and we hope to either address it or mitigate it during the v7 beta. It’s due to the tasks being individual documents, and having to be updated one at a time, so the performance slows way down over a high-latency connection. Our main DB developer is currently out of the office, but once she gets back, we’ll have her start looking into this.

Cheers,
Ryan

nrusch · July 23, 2014, 5:23pm

Cool, thanks Ryan. Any thoughts on the screwy job status message?

rrussell · July 23, 2014, 5:36pm

It would be because the internal task counts for the job object are out of whack. This is also something we’re looking at trying to address in v7. At the very least, we think we found a way to ensure that available tasks for jobs that are out of whack like this still get picked up.

nrusch · July 23, 2014, 7:47pm

OK. Is there any way I can clean that up?

rrussell · July 23, 2014, 7:59pm

That’s also something we want to add to v7 - a way to right-click or something to “fix” a broken job. Currently, the only thing you can do is modify the *Chunks job properties directly in the database.

nrusch · July 23, 2014, 9:51pm

Ah, OK, I see how that works. There’s definitely something screwy going on:

"CompletedChunks" : 67,
"QueuedChunks" : 918,
"SuspendedChunks" : -213,
"RenderingChunks" : 6,
"FailedChunks" : 0,
"PendingChunks" : 0,

This job only has 778 frames. The “Completed” count is correct… the Queued and Suspended are way off.

LaszloSebo · August 2, 2014, 12:11am

This is becoming an issue for us as well. Our german office connects to our database on occasion, and tends to corrupt any job they touch