Deadline jobs sitting queued until machine limit increased

robert.crowther · November 4, 2016, 1:30am

We’re currently experiencing an issue where our simulation jobs with a machine limit of one end up getting queued eventually. We believe this could be a limit issue that we encountered back in the Deadline 7 days. So what happens is that a machine will be rendering a job and an error comes up on the job. The job then sits queued until we manually change the machine limit from 1 to 2. Then it will only pick up one machine. Any ideas on what’s causing this?

Currently upgraded to Deadline 8.0.10.4

LaszloSebo · November 4, 2016, 8:35am

This is affecting most of our simulation jobs, and is really hurting right now. We have to babysit all these jobs
I remember this bug as well from the d7 days, seems like there was a regression? We did not have this problem with v 8.0.4

ryangagnon · November 4, 2016, 3:46pm

It sounds like the first slave that is rendering when the error occurs is not releasing the limit ever. That definitely should not be happening. Are you using custom Limit Groups at all, or just the limit that corresponds to the job?

robert.crowther · November 4, 2016, 5:07pm

Hey Ryan,

The jobs get submitted with usually 3 different limits that we’ve created.Not sure if that answers your question, if not could I get some clarification?

ryangagnon · November 5, 2016, 7:42pm

Hey Robert,

When a job is submitted, a limit is created for that job automatically. These limits are not displayed in the limit panel. Limits that you assign are ones you’ve created. When this error occurs, do you need to up the limit count for the job, or for one of the limits you assigned to the job?

LaszloSebo · November 7, 2016, 11:34am

We need to up the limit count for the job itself.

LaszloSebo · November 7, 2016, 2:36pm

The problem seems very bad, i just found 122 jobs stuck like this.

LaszloSebo · November 7, 2016, 2:40pm

Here is one stall log for example, after which the job was stuck till we increased the machine limit:

STALLED SLAVE REPORT

Current House Cleaner Information
  Machine Performing Cleanup: deadline01.scanlinevfxla.com
  Version: v8.0.10.4 Release (c19fd2cef)

Stalled Slave: LAPRO0677
Slave Version: v8.0.10.4 Release (c19fd2cef)
Last Slave Update: 2016-11-06 00:59:36
Current Time: 2016-11-06 01:10:11
Time Difference: 10.596 m
Maximum Time Allowed Between Updates: 10.000 m

Current Job Name: [GATE] NXN_017_0620_v0039_npf_v39Cache_cache_flowline_Fire_1
Current Job ID: 5817fbadeb608122d8f06b82
Current Job User: nick.pfeiffer
Current Task Names: 1285
Current Task Ids: 335

Searching for job with id "5817fbadeb608122d8f06b82"
Found possible job: [GATE] NXN_017_0620_v0039_npf_v39Cache_cache_flowline_Fire_1
Searching for task with id "335"
Found possible task: 335:[1285-1285]
Task's current slave: LAPRO0677
Slave machine names match, stopping search
Associated Job Found: [GATE] NXN_017_0620_v0039_npf_v39Cache_cache_flowline_Fire_1
Job User: nick.pfeiffer
Submission Machine: LAPRO3145
Submit Time: 10/31/2016 19:19:25
Associated Task Found: 335:[1285-1285]
Task's current slave: LAPRO0677
Task is still rendering, attempting to fix situation.
Requeuing task
Setting slave's status to Stalled.
Setting last update time to now.

Slave state updated.

LaszloSebo · November 7, 2016, 2:56pm

The limit group document for this job in mongo:

deadlineRS0:PRIMARY> db.LimitGroups.find({“Name”:“5817fbadeb608122d8f06b82”}).pretty()
{
“_id” : “5817fbadeb608122d8f06b82”,
“LastWriteTime” : ISODate(“2016-11-07T14:36:43.018Z”),
“Props” : {
“Limit” : 2,
“Overage” : 0,
“RelPer” : -1,
“Slaves” : [ ],
“White” : false,
“SlavesEx” : [ ],
“ThreePLE” : “”,
“UseThreePLE” : false,
“ThreePLPC” : false,
“UnLmt” : false
},
“Name” : “5817fbadeb608122d8f06b82”,
“Holds” : [
“lapro0639”
],
“Used” : 2,
“InOverage” : -1,
“StubLevel” : 0,
“Type” : 1
}

Lapro0639 is the machine currently rendering. The one that stalled was called lapro0677, and you can see that its not listed under “Holds” any longer. But still the document shows “Used: 2”…
The other jobs limitGroups are similarly corrupted.

LaszloSebo · November 7, 2016, 3:02pm

Could the fix for this issue: forums.thinkboxsoftware.com/vie … 10&t=14744 maybe have caused this regression?

LaszloSebo · November 7, 2016, 3:10pm

Randomly picking jobs here from the many that failed. This did it again after having had its limit increased. I have not yet increased its limit further:

deadlineRS0:PRIMARY> db.LimitGroups.find({“Name”:“5817fbaaeb60811aa00f5c57”}).pretty()
{
“_id” : “5817fbaaeb60811aa00f5c57”,
“LastWriteTime” : ISODate(“2016-11-07T14:36:19.470Z”),
“Props” : {
“Limit” : 2,
“Overage” : 0,
“RelPer” : -1,
“Slaves” : [ ],
“White” : false,
“SlavesEx” : [ ],
“ThreePLE” : “”,
“UseThreePLE” : false,
“ThreePLPC” : false,
“UnLmt” : false
},
“Name” : “5817fbaaeb60811aa00f5c57”,
“Holds” : [ ],
“Used” : 3,
“InOverage” : -3,
“StubLevel” : 0,
“Type” : 1
}

Its very odd, since the “used” counter shows 3… even though the limit for the job is 2 (having been manually increased from 1)

ryangagnon · November 7, 2016, 3:28pm

It’s possible, we’ll have to do sopme digging on our end. From that limit you posted, I did notice that the InOverage value less than 0. We had made a fix a few versions ago that should prevent that from happening, but this is especially suspicious considering Job-Level limits do not ever use overage, so that count should never be getting decremented. Can you confirm if the other offending jobs have this issue as well?

ryangagnon · November 7, 2016, 3:53pm

I’ve done some tests, and I’ve confirmed that the issue is that when a slave stalls and tries to resume the job, it seems to be decrementing the limit in overage count for the previous job instead of the inuse value. We’re currently looking into a fix.

LaszloSebo · November 7, 2016, 3:58pm

Thanks Ryan,

So simply rolling back the pulse version (to get the housekeeping, etc jobs to use the old version) would not fix this issue, we would have to roll back the entire farm, or wait for a hotfix from you guys. Correct?

cheers
laszlo

ryangagnon · November 9, 2016, 2:55pm

Rolling back would likely be solution for now, we’re looking to have this fixed for next build of 8.0.

LaszloSebo · November 9, 2016, 3:35pm

Thanks, we have rolled back the pulse machines for now (there were other fixes in 8.0.10 that were also required for us), and it seems to behave much better now. What is your expected timeline for the next update?

jgaudet · November 9, 2016, 6:47pm

We’ll try to get another release together next week.

Cheers,
Jon

robert.crowther · November 17, 2016, 5:32pm

Hey Jon,

Any news on a new build?

jgaudet · November 17, 2016, 5:59pm

Hey Robert,

Unfortunately this got pushed to early next week – I’ll try to get that fast-tracked to you guys as soon as I we get it

Cheers,
Jon

robert.crowther · November 17, 2016, 9:44pm

Thanks Jon!!