Remove Pools and Groups - Bug Report

Hi,

It seems that changing the pools or groups for machines is a bit buggy in newer versions of Deadline - currently if you try and remove a pool (seems to work in the ‘Manage Pools…’ window) and hit ok the change doesn’t get pushed to the machines, and the same occurs when trying to remove Groups.

The only way around this that I can see is to remove all the pools from the machines you want to change and add a new one, hit ok and add them back in manually from scratch while removing the new pool/group. It’s all very fiddly and frustrating - obviously much more tedious over a number of machines that are all different. Adding new pools/groups to machines seems to work just fine, so does promoting/demoting, it’s just the remove that is the issue.

It seems that this bug has come across from Deadline 6 to 6.1, does anyone have any ideas?

Thanks in advance!

We will have to do some internal testing to see if we can replicate this instance. Can you verify that you are running the release version? Thanks.

Thanks for your reply - we’re currently running the release version yes.

Hello,

Ok, so I just did a cursory check, and using Manage Pools and Manage Groups, I was able to add and remove machines from the respective classifications. The wording of your post seems to be a bit confusing, though, as it sounds like you used a different method to remove them. Can you clarify? Also, you say that the changes don’t get pushed to the machines. Do you mean that when changes are made, machines continue pulling from pools they are no longer assigned to? Even after the current task/job is done? If you could provide a bit more info on what is happening we can look deeper. Thanks.

Hi,

So I just installed a fresh database locally and tried to recreate the problem, and found that it works as it should - so I guess it’s something to do with our heavily populated database. Is there something that we could check that might cause such behaviour as fields not updating as they should? Or maybe something you’ve found works for alleviating these buggy issues on large databases?

Thanks

How many slaves do you guys have? We’ve recently uncovered a potential issue due to the way the Monitor loads in the slave data. It first loads in the slave states, and then loads the slave settings (the latter contains the pools and groups). The issue is that between loading the slave states and loading the slave settings, the slaves have a “default” slave settings object which can cause some unexpected behavior if things like pools or groups are set before the actual settings are loaded in. Because the Monitor loads in slave data in batches of 200, it only can affect farms that have more than 200 slaves. We’re addressing this in 6.2 by loading in the slave settings objects first, and then the slave states afterwards. 6.2 should be going into beta very soon, so it would be interesting to see if this helps your problem.

Also, when you guys witness this behavior, does anything show up in the Console panel in the Monitor? You can open it from the View -> New Panel menu. Maybe there are errors that are occurring, and it’s not actually related to the issue in the paragraph above.

Another thing we could do is have you export your slave states, settings, and pools/groups from your database and send it to us. We can then import it into a clean database here and try to reproduce the problem that way. To do this, open a command prompt or terminal on your database machine, change directories to your mongodb bin folder, and run these commands:

mongoexport -d deadlinedb -c SlaveInfo -o slaveinfo.txt
mongoexport -d deadlinedb -c SlaveSettings -o slavesettings.txt
mongoexport -d deadlinedb -c DeadlineSettings -o deadlinesettings.txt

Note that these commands assume your database name is ‘deadlinedb’, so change that if necessary. You can then zip up these files and email them to support:
thinkboxsoftware.com/support/

Cheers,
Ryan

Hey Ryan,

We currently have just less than 300 machines, so that could definitely be the issue. Once the beta is released we can test to see if the issue resolves itself. I just checked the console and no errors are coming up when I try to remove the pools, just to rule that out. I’ll need to double check there are no issues with exporting our settings from the database, but once I get the go ahead with that I will send everything over for proper analysis.

Thank you so much for your help!

Looking to piggy back off this and hopefully get some help here. It seems that last week our database just decided to freak out and now monitor no longer sees any of our slaves. If we submit a job it will still make its way to the slaves, but i can’t make any changes to pools or groups and I can’t manually move jobs if necessary. We’ve only got about 70 slaves though. Wondering if you’ve ever seen anything like this or have any ideas for a solution. We’re still running 6.0

Thanks

Forgot to include this. We’re getting this error in the console:

2014-03-18 15:15:32: Error occurred while updating slave cache: QueryFailure flag was BSONObj size: -286331154 (0xEEEEEEEE) is invalid. Size must be between 0 and 16793600(16MB) First element: Arch: “x64” (response was { “$err” : "BSONObj size: -286331154 (0xEEEEEEEE) is invalid. Size must be between 0 and 16793600(16MB) First element: Arch: “x64"”, “code” : 10334 }). (MongoDB.Driver.MongoQueryException)

Thanks

Hello ModernLeper,

Not sure how your issue is connected with the OP issue, but it sounds like you might need to run a repair on your MongoDB. If you want to give us a call we can do a remote session to help with the repair, or if you know Mongo well enough, you can run it yourself. Basically what it means is running the current Mongo command with --repair appended.

Sorry, guess I was just in too much of a hurry skimming articles and thought the OP had the same disappearing slaves as me, rather than an issue with group and pool management. Anyway, your tip to rebuild the database did the trick and the slaves all returned to the monitor.

Thanks

Hello,

I have the same problem for ‘Groups’ only. (Strange but Pool remove works from the Monitor).
I read the thread, but I didn’t get the right method to fix the problem.

So how remove a group on a slave machine?

I have this idea:

mongoexport -d deadlinedb -c SlaveInfo -o slaveinfo.txt

Fix the json

mongoimport -d deadlinedb -c SlaveInfo --upsert < slaveinfo.txt

That seems straightforward, but maybe risky.
Shall it works?

Thanks

Hello,

The only issue I am aware of with pools and groups was a bug, fixed in 6.1 I believe, that prevented machines being removed or added to pools or groups while the exclude none setting was set for the machine in slave settings. Can you give us some more information on your issue? Unless you are very aware and familiar with using mongo at the command line, I would advise against using that.