Limit Groups don't work

im_thatoneguy · September 11, 2013, 12:35am

We have a limit group for our Nuke license of one license.

But if we have two nuke jobs I’ll see 4 slaves trying to render nuke and erroring and failing because of licensing. It seems to be completely ignoring the nuke limit groups. Can I look at which slave has the stub checked out?

rrussell · September 11, 2013, 8:35pm

We have fixed some limit group issues during the 6.1 beta, but we’re not sure if they are related to this specific problem. Are you running the 6.1 beta yet? If not, are you willing to upgrade to beta 4 (just released today) to see if you still have issues with your limit groups?

Cheers,

Ryan

Dave_Wortley · September 12, 2013, 9:42am

We’ve had similar issues, I’m not sure it’s Deadlines problem, it might be Nuke not releasing the License, if you’ve got 4 licenses and 10 blades with Nuke installed, after 1 finishes, another blade could pick it up but the license hasn’t been released by the first blade.

dwallbridge · September 12, 2013, 3:58pm

Definitely sounds like something to test and investigate I think.

im_thatoneguy · September 12, 2013, 4:39pm

We’re in the middle of a big project right now so I can’t upgrade the farm (it’s pretty much just chugging away 24/7) and I don’t want to be blamed for anything going wrong if it introduces new bugs.

dwallbridge · September 18, 2013, 7:08pm

Let us know how things go when you can.

NewJohnny · September 27, 2013, 1:20am

I was going to post this same thing today. The Foundry flexlm floating license server is showing licenses in use, even though nothing is rendering. Sometimes they’re released, sometimes they’re not. This was not an issue in deadline 5, so something has changed. We do not have issues with other floating licenses or limit groups, just nuke.

MikeOwen · September 27, 2013, 7:18am

Hi,
Are you on 6.1 beta 3 or later (limit group fixes were in beta builds 2 & 3)? Please could you check for me…Are you sure you are running Flexlm and not RLM as your Nuke license server? I believe The Foundry are slowly transitioning clients over to the newer (RLM) license server, especially those moving to Nuke 7.0x which might explain why you are seeing the errors now and not whilst you were on Deadline v5 (ie: not necessarily difference between Deadline v5 and v6, but rather a difference between Nuke 6 and Nuke 7, running on a different licensing system). Could you also confirm what version of Flexlm or RLM that you are running, together with the version of “Foundry License Utility” (FLU) you have installed on your floating license server?

Initially, I would get the beta upgrade to test if this solves the issue and then take it from there…

If you are still on Flexlm, see here for another ongoing discussion about this issue:
viewtopic.php?f=11&t=10189&hilit=+flexlm#p44169

[i]Couple of thoughts here…

Ensure end client is running the latest version of flexlm lmgrd license software. Users tend to install license server software and forget about it. Even downloading the latest FLT from The Foundry would be a good idea. Especially, as they have built a nice wrapper utility in the newer versions. Does this resolve the licenses not being released quick enough? The lmgrd software can be updated independently of the software vendors binary. Maybe a newer version of Flexlm will resolve the issue? Worth a try…

OR --> ask client to contact The Foundry and ask to be shifted over to the RLM licensing system? Perhaps the combo of Nuke 7 & Flexlm doesn’t work well anymore and better to be on RLM?

You could have a RegEx handler for this exact error and do a sleep command, BUT the error occurs right at the beginning of the job as the ManagedProcess starts up. In this case, Nuke.exe hasn’t really introduced any wasted time into the task at hand, but is adding 1 error to the overall error count. However, you get yourself into a possible bad situation by introducing a sleep() function. Slave hits this error and does a sleep, retries after 10 seconds or whatever, hits the error again, goes to sleep again…infinity, whilst other jobs in the queue are waiting to be picked up? Maybe, there is a license issue (mis-config) on this one slave and rightly so, it should fail and not just circle around between 10 sec sleeps?
Flexlm has a “lmstat lmremove” command via it’s utility, which could be executed at “EndJob” of ALL Nuke jobs on ALL slaves (needs to be like this to ensure 100% reliability), which will ensure a slave cleans up it’s particular license it’s pulled from the license server. However, again, 2 major issues. (a) what happens if the license server is unavailable for a split-second…network blip…restart…blue-screen and (b) Google it and you will see reports of repeated “lmstat lmremove” command executions in short repetition over the network to your license server are prone to locking up the flexlm process = BAD!
Enhance “LimitGroup” functionality with a user customisable sleep( variable) for each limit group. If a job gets this limit group stub, then it does a “hold” just before the ManagedProcess starts up. But your slowing your farm down and how long is long enough?
Another possible option:

Use the new ManagedProcess.AbortRender method to re-queue the task without adding to the error count but only if this exact error has been identified (edit the Nuke plugin STDout handler):

CODE: SELECT ALL
FranticX.Processes.ManagedProcess.AbortRender( string “message”, AbortLevel Minor )

Now what Deadline doesn’t currently have is a “re-queue” counter as part of “Job Settings” > “Failure Detection”. My concern here is that I want to know if a job is receiving too many “re-queues” via the normal job notification framework (The Nuke job could be going round in circles forever). I see this as yet more variables in the “Failure Detection” repo job settings: “Fail task after this many re-queues” & “Fail job after this many re-queues” & of course, “Send a Warning to the Job’s User after it has generated this many re-queues”

Option #5 at least introduces a framework which can be re-used to handle license issues across all plugins. I don’t like option #4 as it’s not as flexible as #5 for the future.[/i]