machine limits problem

anon5658894 · March 6, 2015, 11:49am

Hi,

Running 7.0.2.3R on Ubuntu 12.04 & OSX

The problem I’m having is twofold I think.

We’re running a couple of macs for QT generation & I’m running 3 slaves on each. Firstly. when the slaves start up, they don’t necessarily startup with the same machine name. Some have the domain & some .local appended to the hostname.

I know I can add in the the “Host Name Override” in the slave properties, but it’d be handy to be able to set this in a config file somewhere, or if there’s another way of avoiding this problem.

This obviously isn’t great for the machine limits that we have set up, but even so they don’t appear to be behaving as expected. We have a machine limit of 3 set up even though there’s only 2 machines in the pool, and I still seem to have reset the limit because it max’s out even when there’s only two machines listed as Stub Holders - screenshots below.

rrussell · March 6, 2015, 3:33pm

That’s really strange that holly-3 would be behaving differently than holly or holly-2. It’s also not showing the Last Status Update data like the rest of them. Normally, we would think there is a version mismatch here, but since it’s running on the same machine as the other two…

Do they always start up with the same machine name (ie: holly and holly-2 always have holly.local, and holly-3 always has holly.nvizible.com)? Or is it random?

This issue might be related to the limit problem, so let’s focus here first and try and figure out what’s going on.

Thanks!
Ryan

anon5658894 · March 6, 2015, 7:00pm

I’m not really sure if they’re coming up the same each time. I just rebooted them to see & they all came up as .local, but one slave on each of the machines hasn’t come up - they’re showing as stalled.

I tried starting the “stalled” slaves from the monitor, but they wont start, despite having this in the logs:

2015-03-06 18:35:50: Launcher Thread - Received command: LaunchSlave holly-2
2015-03-06 18:35:50: Local version file: /Applications/Thinkbox/Deadline7/Resources/Version
2015-03-06 18:35:50: Network version file: /mount/apps/noarch/thinkbox/DeadlineRepository7/bin/Mac/Version
2015-03-06 18:35:50: Comparing version files…
2015-03-06 18:35:50: Version files match
2015-03-06 18:35:50: Launcher Thread - Responded with: Success|

If I look in here on the macs - /Users/Shared/Thinkbox/Deadline7/slaves/ I see these ini files -

-rwxrwxrwx 1 root wheel 14B 8 Nov 11:38 holly.ini
drwxrwxrwx 8 root wheel 272B 1 Dec 15:43 …
-rw-r–r-- 1 root wheel 58B 2 Mar 16:25 2.ini
-rw-r–r-- 1 root wheel 14B 6 Mar 13:03 holly-2.ini
-rw-r–r-- 1 root wheel 58B 6 Mar 18:28 .ini
-rw-r–r-- 1 root wheel 58B 6 Mar 18:28 3.ini

Previously when I’ve had these issues I delete the ini file & restart the slaves, & that seems to fix it, but once I did that I got this:

holly & holly-2 actually both .nvizible.com, but I swear holly changed to holly.local before my eyes, as I took the screen shot…

rrussell · March 6, 2015, 7:18pm

On this machine, close ALL slave applications that are currently running (use the Activity Monitor to make sure you got them all). Then delete holly.ini and holly-2.ini and reboot.

I think the problem is that Deadline is treating holly.ini and .ini as the same slave, but starts up a separate instance for each of them. Because they have the same slave name, they update the same row in the slave list in the Monitor, and that’s why you’re seeing the host name change randomly in the Monitor.

For the setup you want, you should only have the following ini files:
.ini
2.ini
3.ini

If holly.ini and holly-2.ini come back by themselves, then we need to figure out why.

Perhaps going forward, if the .ini file name starts with the machine name, we should just ignore it…

Cheers,
Ryan

anon5658894 · March 12, 2015, 11:31am

I did as you described for holly and got these files after restarting -

-rw-r–r-- 1 root wheel 58 7 Mar 21:10 .ini
-rw-r–r-- 1 root wheel 58 7 Mar 21:10 1.ini
-rw-r–r-- 1 root wheel 58 7 Mar 21:10 2.ini
-rw-r–r-- 1 root wheel 58 7 Mar 21:10 3.ini

That ended up giving me these slaves:
holly - holly.local
holly-1 - holly.local
holly-2 - holly.local
holly-3 - holly.nvizible.com

The holly-1 slave was a new slave, in that it wasn’t in the right pool & wasn’t set up as the other slaves.

I’ve just tried it again with holly & it seems to have come up OK this time - with just .ini, 2.ini & 3.ini. They still come up with the machine name with a .local after it. This means I have to set the hostname/ip address override in the slave properties in order to be able to connect to it from the monitor. It’d be nice not to have to do that as well.

So, it doesn’t seem to be consistent.

We currently have 6 slave processes running on the 2 machines (no extraneous ones running at the moment) and we’re still getting the machine limit being max’d out.

rrussell · March 12, 2015, 12:53pm

I wonder if this host name issue only comes up after a reboot. What happens if you manually close the three slaves and then start them up again (without rebooting the machine)? Perhaps the machine hasn’t acquired its DNS host name yet when some slaves start up after a reboot?

We’ll run some more tests with the Machine level Limits as well to see if we can reproduce this problem on our end.

Cheers,
Ryan

rrussell · March 12, 2015, 4:16pm

Another question: In the Slave Settings panel on the Repository Options, do you have the “Use Fully Qualified Domain Name” option enabled or disabled?
docs.thinkboxsoftware.com/produc … e-settings

Thanks!

rrussell · March 12, 2015, 4:37pm

One more thing: In super user mode in the Monitor, right-click on the Maxed out “nuke_cnv” limit and select Reset Limit Usage Count. After doing that, does the problem happen again?

Just wondering if maybe things got in a bad state somehow, and if a reset will help…

anon5658894 · March 12, 2015, 4:57pm

So, I tried to stop the slaves from the monitor & it wouldn’t work. I had to kill the slave processes. I started them again from the monitor & up came 3 processes as below:

0 14194    62   0  4:43pm ??         0:12.42 DeadlineSlave7 /Applications/Thinkbox/Deadline7/Resources/deadlineslave.exe -nogui -name holly
0 14196    62   0  4:43pm ??         0:14.17 DeadlineSlave7 /Applications/Thinkbox/Deadline7/Resources/deadlineslave.exe -nogui -name holly-2
0 14198    62   0  4:43pm ??         0:14.11 DeadlineSlave7 /Applications/Thinkbox/Deadline7/Resources/deadlineslave.exe -nogui -name holly-3

And we now have these ini files:

-rw-r–r-- 1 root wheel 60 12 Mar 16:43 .ini
-rw-r–r-- 1 root wheel 60 12 Mar 16:43 2.ini
-rw-r–r-- 1 root wheel 60 12 Mar 16:43 3.ini
-rw-r–r-- 1 root wheel 14 12 Mar 16:15 holly-2.ini
-rw-r–r-- 1 root wheel 14 12 Mar 16:15 holly-3.ini
-rw-r–r-- 1 root wheel 14 12 Mar 16:15 holly.ini

Their machine name in monitor is holly.nvizible.com

The “Use Fully Qualified Domain Name” option is disabled.

Resetting the limit usage count usually seems to work for a short while, and we get some more tasks picked up, but the problem returns eventually.

rrussell · March 12, 2015, 5:09pm

That’s really strange how those .ini files come back. I’ve tested manually starting the slaves through the Monitor and I can’t reproduce this (I have 3 slaves configured on this machine).

Can you double check the Deadline version on these Macs? The slave list in the Monitor should show the version number.

Also, have these Macs ever had an older version of Deadline 7 installed (ie: a beta version)? If so, I wonder if an automatic upgrade went bad or something. Maybe try stopping the slaves, deleting those ini files, and then run the Client Installer to see if a reinstall fixes the issue.

anon5658894 · March 13, 2015, 4:48pm

The version listed in the monitor for all of the slaves is v7.0.2.3, these machines have had beta versions on them previously.

I’ll perform a re-install when the machines are less busy than they are now - so probably by Monday.

Cheers
Andrew

anon5658894 · April 2, 2015, 9:37am

Hi

I eventually got around to this. It looks like the re-install did the trick - 1 set of ini files on each machine - .ini, 2.ini & 3.ini. And, so far, it seems like the machine limits are behaving themselves.

Cheers
Andrew

anon5658894 · May 13, 2015, 8:00am

Hi,

I’m afraid I’m still getting this issue.

It seems a lot more stable & usually the slaves come up on boot as I’d expect them to. I am still getting issues with the limits though - I’ll attach some screen grabs of weird limit behaviour below.

So, we still have to reset the limit usage count every now and again because these jobs back up on the farm.

I am contemplating upgrading to 7.1 in the very near future. I checked through the release notes, but couldn’t spot anything specific to this issue, though there is some stuff in there that looks like it may solve it… I’ll post again if the issue persists after the upgrade.

Cheers
Andrew

rrussell · May 13, 2015, 1:02pm

We also believe that some things that changed in 7.1 might help you here. Definitely let us know if you still have this problem after upgrading and we can continue to investigate further.

Cheers,
Ryan

anon5658894 · June 3, 2015, 2:10pm

Here’s an update after upgrading to 7.1 (7.1.0.35). The slaves are starting up consistently & the machine name is coming up as expected, but I’m still maxing out on the machine limit even though the number of machines is 2 & the limit is set at 4.

jgaudet · June 3, 2015, 5:14pm

After looking at the code a bit, I think I’ve found a couple issues that could lead to this state with machine-wide limits. I just wanted to confirm a couple things with you before I go about fixing them

Could you confirm that running a “Repository Repair” (Tools -> Run Repository Repair in the Monitor) once the Limits get out of whack does not fix it? I suspect it won’t, but I wanted to make sure.

Can you keep an eye out (or maybe look in some of the older Slave logs around the time where this has happened) for errors that say something like “Failed to check out Stub for Limit with name […]”? If you find any of those, and post them here it’d be super helpful!

Cheers,
Jon

anon5658894 · June 4, 2015, 2:27pm

Hi,

Running the repository repair doesn’t fix the issue - not sure whether it might do straight away, but I reset the limit usage.

We have the issue now anyway. There are none of those messages in the logs at the moment. They have happened previously, but not now while the issue is occuring.

Cheers
Andrew

jgaudet · June 4, 2015, 4:22pm

Alright, it sounds like it is the issue I was able to reproduce on my end.

I’ve made a fix for this internally, which will be in the 7.2 beta when that starts up. As part of the same change I’ve also made sure that Repository Repair will at least fix this if it comes up again, so that you at the very least you won’t need to reset the counts manually anymore.

Cheers,
Jon

anon5658894 · June 4, 2015, 4:29pm

Great, thanks for that! We’ll keep an eye out for the beta.

rrussell · June 8, 2015, 3:21pm

Just an update that this limit issue should be fixed in the 7.1.2 release candidate that we uploaded to the beta forums this morning:
forums.thinkboxsoftware.com/vie … 04&t=13464

There were a couple other bugs that we wanted to fix for this 7.1 patch release, and we figured it would be a good idea to backport this limit fix as well. Give it a try and let us know if you still run into any issues.

Select your cookie preferences

Customize cookie preferences

Essential

Performance

Functional

Advertising

machine limits problem