AWS Thinkbox Discussion Forums

Zombie Nuke processes left behind by dedline

Hi all,

We had an issue with deadline giving license errors on Nuke jobs. In the limits pane for the nuke license I could see only 20 licenses were activly being used (we have 36 available) and yet we still had license errors. After checking on our RLM server I can see multiple processes running on several of the machines, each using a license (I think). These jobs aren’t actually processing, they appear to be zombie jobs left behind when Deadline marks the job as completed. Something is happening that prevents deadline from killing the nuke batch process and its messing with our license count. Any thoughts on why this might be happening?

Also, is there any way to get deadline to restart machines on a schedule? as a work around we are restarting the machines every evening to clean out the residual processes and licenses but this is a pain to do manually. Does deadline have a schedulable restart feature?

Thanks, Tom

Actually, what we’ve found is an upgrade to Nuke and to the RLM license server resolves this in (so far) all cases. We’re not quite sure how Nuke is surviving Deadline’s request for it to die or why Deadline isn’t detecting that it isn’t dead but if the upgrade works, I think it’s worth a try.

We’re seeing a similar issue. Nuke 11.1.v3, DL 10.0.6.3. Restarting the DL slave process (not the whole box) gets the nuke process to stop and thus the license to drop.
The problem doesn’t always happen. Anecdotal evidence that it happens more when running concurrent instances of nuke (for speed).

What OS is it running on? I’m waiting on 10.0.21 which will make some changes to how processes are cleaned up on Windows.

We’ve seen it happen on Windows 8, RHEL 6 and CentOS 7.

Been having the same issue Windows 10, Deadline 10.0.18.1. Nuke 10.5, Nuke 11, Nuke 11.2v2. It’s been hard to diagnose what causes the issue. Seems to happen more often when there’s many nuke jobs or a lot of concurrent task usage on the farm.
Using FLU 7.3v1 64bit on Windows Server 2016 (I think this is the latest FLU release?)
I believe the FLU RLM build is v12.2 (build:2).

Update: I did notice that the Foundry has an entry about how to upgrade to a later version of RLM:

Latest RLM for Windows Server is rlm.v12.4BL2 - Do you recommend trying this?
Thanks,
-Jake

It’s helped for some folks in the past, especially upgrading from Nuke 10 and… a version of RLM I can’t remember offhand. Especially for those facing it on Linux.

After upgrading to build of RLM 12.4BL2, we are still experiencing this issue.

Can you check what builds of Nuke are being orphaned? The easy way to check is to use the “Open File Location” menu option in the “Details” section of Task Manager.

I haven’t yet had a chance to monitor the problem as it seems to have gone away!

We came across an issue that may (or may not) have been contributing to this.
From Nuke Log:
0/09 11:57 (rlm) Web server starting on port 4102
10/09 11:57 (rlm) Using TCP/IP port 4101
10/09 11:57 (rlm) … error binding UDP port 5053, port in use
10/09 11:57 (rlm) This is probably due to another copy of RLM running
10/09 11:57 (rlm) While not fatal, this instance of RLM won’t respond
10/09 11:57 (rlm) to broadcast requests.
10/09 11:57 (rlm) Starting ISV server foundry on port 54400

We removed the conflicting RLM process that was using port 5053. Since then, seems like things are working.

Recently had a new rash of incidents with this. Currently running v10.0.25.2, on CentOS7 and win 8/10. The shutdown procedure for plugins isn’t very clear. I’ve tried adding print statements where I expect the code to go at the end of a job, but I’m not seeing them.

Happening with Nuke11 and 12. Restarting the DL slave process clears it up.

Privacy | Site terms | Cookie preferences