Deadline 7.2 Limits not working

Since upgrading to version 7.2 it seems that the limits are no longer working. It tracks that the stubs are in use and says the status is “Maxed Out” even listing the stub holders. It doesn’t however prevent any other slaves picking up jobs with the limit. Has something changed with how this was setup or is this just a bug?

Nick

No ideas immediately come to mind for what might be causing this. I’ve not seen this reported by others.

Are you running Pulse? Also, did you auto-upgrade the client machines or manually upgrade them?

We’ve got a similar problem.
The stubs do get limited, but it’s always more than the limit we set up.

In our case we have a limit group for C4D set to 50. When using this limit we constantly see C4D jobs getting 53 or 54 slaves assigned to them.
We have now set the limit to 46, and the jobs usually get 49 slaves assigned, sometimes 50, however always inside the 50 license limit we have.

Dave

We have the same issue since we are on 7.2.

We also have the feeling that the job scheduling order “pool, priority, first-in…” does not seem to work either.

Maybe the source of the problem is the same.

Timor

We’re discussing internally how we’re going to start testing this. Since @maxplugins is busy at the moment, @timor, would you be willing to run some tests for us? We have to decide what those tests are yet however.

I think we’re in a good place Timor, is there a time (probably in the afternoon for you) that we could work together? Can you e-mail me at edwinamsler@thinkboxsoftware.com so we can work out a time?

Here’s a screenshot of what keeps happening here.
As you can see, there are 50 slaves rendering the job, the stub limit is set to 48, and Deadline thinks there are only 46 working on the job.

Dave

So, best guess at the moment is that the stubs just aren’t populating properly.

Do you want to to try the steps I posted over here and see what comes out of it: (you’d need to change ‘job id’ to the name of the limit stub

forums.thinkboxsoftware.com/vie … 11&t=14027

I’ll try this out tomorrow Edwin.

Just to be clear, do I need to put the job id in there, or the name of the limit?
I’m a bit confused…

Dave

The name of the limit. Secretly behind the scenes, the per-job limit is just a standard limit group whose name matches a job id. We filter those out in the interface so that we don’t clutter things up.

Hi Edwin,

here’s what I get when I use your code snippet:

> db.LimitGroups.find({_id: "c4d"}).pretty(); { "_id" : "c4d", "LastWriteTime" : ISODate("2015-12-12T22:42:50.455Z"), "Props" : { "Limit" : 49, "RelPer" : 0, "Slaves" : [ ], "White" : false, "SlavesEx" : [ ] }, "Name" : "c4d", "Holds" : [ ], "Used" : 0, "StubLevel" : 0, "Type" : 0 }

The output looks pretty useless to me, but maybe it’ll help.

Dave

It’s a bit odd to me to TBH. I noticed that it’s showing as 49 instead of the 48 in the screenshot. Are you scripting a limit change? If so, I’m probably need to try and do the same to try reproducing…

I’m having Jon look this over too since it looks like a bunch of information is missing here for that limit and I want to make sure I’m not misinformed about this stuffs. Was this taken when C4D was doing some rendering? I expected the ‘slaves’ list to contain the names of the Slaves currently rendering.

The change from 48 to 49 is down to me changing the limit a couple of days ago.
We actually have 50 licenses, and I really need to use all of them if possible.
Most of the time the limit works as expected, but I didn’t want to risk changing all the way to 50.

I probably got the info when no Cinema stuff was rendering, the last few days it’s all been Max and Nuke.
As soon as I have a Cinema job rendering I’ll do this again and post the information.

Dave

Edit:
Okay, there was a Cinema job rendering so I output the info again:

> db.LimitGroups.find({_id: "c4d"}).pretty(); { "_id" : "c4d", "LastWriteTime" : ISODate("2015-12-15T11:46:57.758Z"), "Props" : { "Limit" : 49, "RelPer" : 0, "Slaves" : [ ], "White" : false, "SlavesEx" : [ ] }, "Name" : "c4d", "Holds" : [ "schimpanse72", "schimpanse49", "schimpanse65", "render1", "render2", "schimpanse11", "schimpanse06", "schimpanse07", "schimpanse44", "schimpanse26", "schimpanse45", "schimpanse34", "schimpanse15", "schimpanse25", "schimpanse47", "schimpanse43", "schimpanse29", "schimpanse42", "schimpanse33", "schimpanse37", "schimpanse14", "schimpanse18", "schimpanse08", "schimpanse03", "schimpanse02", "schimpanse36", "schimpanse32", "schimpanse38", "schimpanse35", "schimpanse41", "schimpanse46", "schimpanse40", "schimpanse16", "schimpanse01", "schimpanse39" ], "Used" : 35, "StubLevel" : 0, "Type" : 0 }

Unfortunately there weren’t enough machines available to test the limit, but as soon as I see a job using more than allowed, I’ll do it again.

Thanks! These dumps are perfect for diagnosing. Measurements should give us something to go on.

I didn’t chase Jon well enough yesterday, but looking at this dump it’s fairly obvious the Slaves list is for the whitelist/blacklist and not in fact the tokens for the limit.

Hey David,

Which version are you guys running exactly? I know we’ve fixed a similar bug in the Deadline 7.2 beta, so you might just be experiencing that if you’re on an older beta. Note that due to the nature of the bug, it might also be caused just by having some machines on that older beta, if they’re performing House Cleaning operations. If you see any Slaves / Pulse machines that are running a version older than 7.2.0.18, I would upgrade them and see if this fixes the problem

Cheers,
Jon

Hi Jon,

all machines are running 7.2.0.18.

Dave

Are you running Pulse? If so, could you get us a Pulse log surrounding the times where you experienced this issue with one of your limits? It might be returning Limit Stubs that shouldn’t be getting returned, which could lead to this issue.

Hi Jon,

as soon as I see it happening again I’ll grab the Pulse log for you.
Strangely the problem hasn’t happened at all this past week, the limit hasn’t gone above 49 at all.

Cheers,

Dave

Edit: I just had an idea. We moved Pulse to a different machine last week, maybe that has made the difference. Before that it was running on our main file server, and that was being slowed down by the huge amount of TCP requests (I think) coming from our render nodes on the Pulse port. Since moving Pulse, the file server is back up to speed and the problem hasn’t happened again. Maybe that has something to do with it?

Edit2: I’ve just opened a support ticket pointing to this thread with a Pulse log from one of the days where I definitely had this problem.