AWS Thinkbox Discussion Forums

[D8] NUKE - Limits being ignored [UNSOLVED]

Yeah we can give that a shot!

:slight_smile:

I could limit the nuke renders to specific nodes, if the limit works :stuck_out_tongue:

Well, you could technically do that now I suppose… The issue you’re hitting is in the book keeping for who has a limit and who doesn’t so if you wanted to just isolate the limit to a specific set of machines that should work just fine.

I want to discuss some testing ideas with the dev guys for limits… Right now it’s a very hard system to troubleshoot.

Thanks!

It would be much helpful.

In regards to the limits, here’s what I found in Deadline 8.0.10.4

WORKING:

  • Limit to certain whitelisted machines

NOT WORKING:

  • Limit a certain amount of licenses/stubs
    Not all machines are adhering to the limit and the limit window never shows the actual amount of machines currently trying to render that job.

  • For some reason Negative limits stubs in use can appear.

We’ve had fixes for limit groups in SP11 and there will be more in SP12. Code is currently being reviewed to catch negative stubs and correct them, but I’m not sure if that’ll land in SP13 or in 8.1 since there is a lot of logic additions there and we may want it tested via the beta stream first.

Cool, will upgrade to SP11 tonight if we can.

Hate to revive this topic and I know we are still on 8, but we are running into this issue (different studio, different hardware, different people) again.

Did we get to any conclusion on what this was? And if so, is this fixed in later versions? (I assume so, a lot has changed going to 10)

Funny you should ask. I just got off a call 20 minutes ago with a client trying to reproduce this.

I’m guessing it must be something sticky between Nuke and its license server, but we kept trying to reproduce on the call but couldn’t. At the moment the client put their limit back up to the number of licenses they have and will call me when they see it next. We only ever see this with Nuke, and people are either numbing to the problem or new version of Nuke just don’t have the problem. I seriously hope it’s not the first of those.

Here’s what I wrote for that client:

I’m expecting the licenses aren’t being given back right away. I’d want to check the license server and what machines had licenses checked out, and compare it to the list of stub holders from the limit (you can copy & paste this). We’ll want to see what’s not the same and I’m expecting the license server to have some older machines listed that are not currently stub holders. We’de ideally want the list of the machines that threw errors as well.

This is all to test if the license server is being stickier than it should be. I’m also wondering if cancelling the job and restarting it is what is causing problems with those licenses. Maybe we’re closing Nuke from batch-mode in an unkind way? The test for that would just be turning off batch mode for awhile and submitting tasks with a size bigger than 1. That’d offset the overhead of re-starting Nuke every task.

Okay so I have a few things I have found out:

[*]We have pipeline tasks running as command line scripts that use nuke.

I ran a taskkill /f /im nuke10.5.exe on all nodes when no nuke or scripts renders where running.
Some node (the ones the command succeeds on) reported the killed a instance of nuke.

So one part of the issue could well be orphan nuke processes on the nodes that did not close and hog a license.

[*]We have 20 nuke_r licenses, I have set the limit to 18 to account for some stickyness and the problem now seems less apparant.
Perhaps there is indeed some latency between the node stopping its task and the license server refreshing the license status.

It seems that setting the limit to 18 (with us having 20 lics) has mitigated the issue.
Still not a perfect solution though

No, we should find out why these processes aren’t exiting.

If I remember correctly here, I think that Deadline will ask the process to exit via the API then do something more drastic. I’m not sure however if we send a SIGTERM followed by a SIGKILL or not. If you’d be willing to see what state those lingering processes are in, I’d appreciate it. Mainly checking for what flags were passed in and if it’s zombied.

I think you can just do something like this on all the machines to gather this info:

for nuke in `ps -C nuke -o pid=`; do
        hostname
        cat /proc/$nuke/cmdline
        echo ""
        cat /proc/$nuke/status | grep "State:"
        echo ""
done

That should list what commands were fed into Nuke and we can see what Deadline ran and what it didn’t, along with what state the process was in (running, stopped, zombied). Remove all the newlines if you want to make it a one-liner to push through the process you used to kill Nuke before.

Here’s what I’m expecting to see (used vi as my test):

MyMachine
vi test
State:  T (stopped)

MyMachine
vi test2
State:  T (stopped)

I would love to but linux commands won’t work on Windows :wink:

LOL. Whoops. :smiley:

You know, taskkill and all those forward slashes really should have given it away.

Okay, that really puts a kink in my thought process. I guess the next time you see lingering Nuke’s on a machine could you grab Process Explorer from sysinternals? It should say when the app was launched and we can cross reference that to the Slave log to see when it started and when it failed to close.

I will, this will be somewhere next week :slight_smile:

  1. Does your Nuke license server use Flexlm or RLM? What exact version of the binary are you using of either Flexlm or RLM?

  2. What “Usage Type” is your Deadline “Nuke” limit? ie: go to the limit props and check top-right corner setting: slave, machine or task?

I regularly see sysadmins think a machine is checking out multiple licenses of Nuke when they inspect the RLM LIC server log. In fact, RLM internally de-dupes multiple licenses server requests if the vendor in question provides machine-centric as their customer licensing schema.

  1. RLM

  2. I have tried with both “slave” and “Machine”.
    Both have the same issue.

We mainly use concurrent tasks to render nuke multithreaded depending on the complexity of the script.

Was this ever resolved? We are running Deadline 10.0.10.4 and seeing this intermittently. I haven’t been able to pin down what’s causing this.

Actually, I think the problem may have dried up but I’ll check with the rest of the support folks to be sure. I think there was a problem where Nuke could become orphaned and sit in the background secretly consuming a license. This seemed to be most common with a certain version of Nuke.

Which are you running at the moment? The problem became less pronounced pretty early in Deadline 10’s dev cycle so you should be okay these days. If you’re on the latest service pack of Nuke then we should go and dig back in.

We’re currently a bit behind on Nuke 10.0v1, but should be jumping up to latest 11 soon. I think the orphaned background process is likely the problem. I’ll see if I can get some kind of scan going to detect it.

Okay. I do think that 11 is going to help a lot here, so it may not be worth the code. Others can chime in here if they’ve seen improvements or not.

Is anyone still seeing this with Nuke 11? I’ve got a really weird Slave report from someone and it looks like on Linux, Nuke just detached and was floating around. The Slave seems to have forgotten it ever started it.

Privacy | Site terms | Cookie preferences