Balancer not recognizing active instances

sipickles · February 23, 2016, 4:25pm

Hi,

We’re using 8.0.0.50-beta on CentOS.

I’ve been working on this bug for a few days. We have a custom ‘cloud’ plugin which worked in 7.1.2.1, but doesnt in 8 (tried in 8.0.0.50+).

During the balancer cycle, a single host is requested. This is returned as a list from CreateInstances. Then, on the next cycle, GetActiveInstances finds this new host, and returns it.

However, for some reason, Balancer does not see the active instance as valid, and tries to start another one. This repeats endlessly each cycle.

I wrote the simplest possible cloud plugin which just mocks up the host creation (ie, no external systems involved). It is still broken!

Here’s the balancer output:

Worker Thread (554): Executing the Algorithm: DefaultAlgorithm... Creating Balancer State Struct Computing Demand Group (nukex): TaskCount: (10.0) Weight: (500.0) Determining the Available Resources Computing Targets Populating Targets Group: nukex Target: 1 Group: modo Target: 0 Algorithm (554): Targets: Algorithm (554): Region: forge Algorithm (554): Group: nukex: 1 Algorithm (554): Group: modo: 0 Worker Thread (554): Equalizing targets... Equalizer (554): Region: forge : Enabled PYTHON: *** GetActiveInstances *** PYTHON: *** populate_CloudInstance *** PYTHON: ---- ID 123456 PYTHON: ---- Name ForgeTestInstance PYTHON: ---- Hostname ForgeTestHost PYTHON: ---- InstanceState 2 PYTHON: ---- ImageID ForgeTestImage PYTHON: ---- HardwareID ForgeTestHardware PYTHON: ---- PublicIP PYTHON: ---- Provider Forge PYTHON: ---- GroupName nukex PYTHON: ---- RegionName forge PYTHON: ---- Zone forge PYTHON: ------ Appending 1793703675 PYTHON: ------ Active 1 Equalizer (554): Group: modo: Enabled 0 / 0 Equalizer (554): Group: nukex: Enabled 0 / 1 Equalizer (554): Requesting 1 new instances be started. # <<<<<<<<<< WRONG! PYTHON: *** CreateInstances *** PYTHON: *** populate_CloudInstance *** PYTHON: ---- ID 123456 PYTHON: ---- Name ForgeTestInstance PYTHON: ---- Hostname ForgeTestHost PYTHON: ---- InstanceState 2 PYTHON: ---- ImageID ForgeTestImage PYTHON: ---- HardwareID ForgeTestHardware PYTHON: ---- PublicIP PYTHON: ---- Provider Forge PYTHON: ---- GroupName nukex PYTHON: ---- RegionName forge PYTHON: ---- Zone forge PYTHON: Starting 1793703675

As you can see, the hash from the instance which is started is IDENTICAL to the hash of the instance which is returned by GetActiveInstances.

Is this really a bug?

simple.py plugin attached as text file
simple.py.txt (3.27 KB)

sipickles · February 23, 2016, 4:49pm

Ok,

I’ve found that CloudInstance no longer has an attribute ‘InstanceState’. Its now ‘Status’. I missed that in the release notes.

Still same problem. Updated plugin below. Perhaps my test case is too contrived.
simple2.py.txt (3.25 KB)

eosiowy · February 23, 2016, 5:04pm

Hey Simon,

I’ve taken a look at your simple plugin and what you’re experiencing is actually expected behavior. Let me explain.

The first Balancer cycle starts. The Balancer checks all the running instances and compares them to an internal list of instances it’s created. We do this to make sure that we have an accurate picture of the instances we’ve started. In our simple plugin case, we haven’t started any instances yet but we see 1 instance out there. That’s fine. The Balancer doesn’t touch instances it hasn’t created so it’s ignored. Then we determine our targets and we see that we need to start 1 instance. Which is true, the Balancer hasn’t started any instances yet and there’s work to be done so we need to start an instance. Now the Balancer records the info for our newly created instance to our internal list.

Then the second Balancer cycle happens. This time the result is different. We check our active instances and compare it to our internal list and what do you know, there’s a match. Looks like the instance the Balancer started is still there. Then we determine our targets, it’s still 1 because there’s still work to do and then we move on to the Equalizing step and the Balancer sees that it’s already started 1 instance and that fits our target so we don’t need to start any more.

Here you can see my Balancer after multiple cycles. It only tries to create an instance the first time.

Hopefully that explains what you’re seeing. If you have any more questions please let me know.

Thanks,
Eric Osiowy

sipickles · February 23, 2016, 5:25pm

Hi Eric

Thanks for the explanation.

Interestingly, when you run that plugin it works! When I run it, I get very different output. It keeps trying to start a new slave - it never matches the internal state. Here is the log output from a few balancer cycles:

Worker Thread (665): Updating the State... State Cache (665): Setup... State Cache (665): Populating Cloud Regions... State Cache (665): Finished populating 1 Cloud Regions. State Cache (665): Populating Slave data... State Cache (665): Finished populating 1 Slaves. State Cache (665): Populating Limits data... State Cache (665): Finished populating 1 Limit groups. State Cache (665): Populating Job data... State Cache (665): Finished populating 1 Jobs. State Cache (665): Populating Groups data... State Cache (665): Finished populating 2 Groups. Worker Thread (665): Executing the Algorithm: DefaultAlgorithm... Creating Balancer State Struct Computing Demand Group (nukex): TaskCount: (10.0) Weight: (500.0) Determining the Available Resources Computing Targets Populating Targets Group: nukex Target: 1 Group: modo Target: 0 Algorithm (665): Targets: Algorithm (665): Region: forge Algorithm (665): Group: nukex: 1 Algorithm (665): Group: modo: 0 Worker Thread (665): Equalizing targets... Equalizer (665): Region: forge : Enabled PYTHON: *** GetActiveInstances *** PYTHON: *** populate_CloudInstance *** PYTHON: ---- ID 123456 PYTHON: ---- Name ForgeTestInstance PYTHON: ---- Hostname ForgeTestHost PYTHON: ---- InstanceStatus 2 PYTHON: ---- ImageID ForgeTestImage PYTHON: ---- HardwareID ForgeTestHardware PYTHON: ---- PublicIP PYTHON: ---- Provider Forge PYTHON: ---- GroupName nukex PYTHON: ---- RegionName forge PYTHON: ---- Zone forge PYTHON: ------ Appending 1793703675 PYTHON: ------ Active 1 PYTHON: returning Ä<Deadline.Cloud.CloudInstance object at 0x7fb432f56518>Å Equalizer (665): Group: modo: Enabled 0 / 0 Equalizer (665): Group: nukex: Enabled 0 / 1 Equalizer (665): Requesting 1 new instances be started. PYTHON: *** CreateInstances *** PYTHON: *** populate_CloudInstance *** PYTHON: ---- ID 123456 PYTHON: ---- Name ForgeTestInstance PYTHON: ---- Hostname ForgeTestHost PYTHON: ---- InstanceStatus 2 PYTHON: ---- ImageID ForgeTestImage PYTHON: ---- HardwareID ForgeTestHardware PYTHON: ---- PublicIP PYTHON: ---- Provider Forge PYTHON: ---- GroupName nukex PYTHON: ---- RegionName forge PYTHON: ---- Zone forge PYTHON: Starting 1793703675 PYTHON: returning Ä<Deadline.Cloud.CloudInstance object at 0x7fb432f56518>Å Worker Thread (12): Waiting 20 seconds before next cycle. Worker Thread (666): House Keeping... HouseKeeper (666): Beginning house keeping... HouseKeeper (666): House keeping completed. Worker Thread (666): Updating the State... State Cache (666): Setup... State Cache (666): Populating Cloud Regions... State Cache (666): Finished populating 1 Cloud Regions. State Cache (666): Populating Slave data... State Cache (666): Finished populating 1 Slaves. State Cache (666): Populating Limits data... State Cache (666): Finished populating 1 Limit groups. State Cache (666): Populating Job data... State Cache (666): Finished populating 1 Jobs. State Cache (666): Populating Groups data... State Cache (666): Finished populating 2 Groups. Worker Thread (666): Executing the Algorithm: DefaultAlgorithm... Creating Balancer State Struct Computing Demand Group (nukex): TaskCount: (10.0) Weight: (500.0) Determining the Available Resources Computing Targets Populating Targets Group: nukex Target: 1 Group: modo Target: 0 Algorithm (666): Targets: Algorithm (666): Region: forge Algorithm (666): Group: nukex: 1 Algorithm (666): Group: modo: 0 Worker Thread (666): Equalizing targets... Equalizer (666): Region: forge : Enabled PYTHON: *** GetActiveInstances *** PYTHON: *** populate_CloudInstance *** PYTHON: ---- ID 123456 PYTHON: ---- Name ForgeTestInstance PYTHON: ---- Hostname ForgeTestHost PYTHON: ---- InstanceStatus 2 PYTHON: ---- ImageID ForgeTestImage PYTHON: ---- HardwareID ForgeTestHardware PYTHON: ---- PublicIP PYTHON: ---- Provider Forge PYTHON: ---- GroupName nukex PYTHON: ---- RegionName forge PYTHON: ---- Zone forge PYTHON: ------ Appending 1793703675 PYTHON: ------ Active 1 PYTHON: returning Ä<Deadline.Cloud.CloudInstance object at 0x7fb432f469e0>Å Equalizer (666): Group: modo: Enabled 0 / 0 Equalizer (666): Group: nukex: Enabled 0 / 1 Equalizer (666): Requesting 1 new instances be started. PYTHON: *** CreateInstances *** PYTHON: *** populate_CloudInstance *** PYTHON: ---- ID 123456 PYTHON: ---- Name ForgeTestInstance PYTHON: ---- Hostname ForgeTestHost PYTHON: ---- InstanceStatus 2 PYTHON: ---- ImageID ForgeTestImage PYTHON: ---- HardwareID ForgeTestHardware PYTHON: ---- PublicIP PYTHON: ---- Provider Forge PYTHON: ---- GroupName nukex PYTHON: ---- RegionName forge PYTHON: ---- Zone forge PYTHON: Starting 1793703675 PYTHON: returning Ä<Deadline.Cloud.CloudInstance object at 0x7fb432f469e0>Å

eosiowy · February 23, 2016, 5:32pm

So I did have to add a .param file with the py file. I was getting some weird behavior if I left it blank, so I added a control to it. Did you create a param file? Did you have to put anything in it?

sipickles · February 23, 2016, 5:34pm

Param file just contains:

[Enabled] Type=boolean Category=General CategoryOrder=0 Index=0 Label=Enabled Default=false Description=Whether or not this Cloud Region is enabled.

sipickles · February 23, 2016, 6:46pm

I found this in the balancer log, a lot earlier. Is it relevant?

Worker Thread (12): Waiting 20 seconds before next cycle. Worker Thread (665): House Keeping... HouseKeeper (665): Beginning house keeping... System.NullReferenceException: Object reference not set to an instance of an object at Deadline.StorageDB.MongoDB.MongoCloudStorage.GetCloudRegion (System.String cloudRegion, Boolean invalidateCache) <0x41961ee0 + 0x00041> in <filename unknown>:0 at Deadline.Balancer.BalancerHouseKeeper.HouseKeep (System.String& errorMessage) <0x41921720 + 0x0031e> in <filename unknown>:0

eosiowy · February 23, 2016, 6:52pm

I’m not seeing that on my end. What are your permissions on the plugin directory and the balancer directory?

sipickles · February 23, 2016, 7:26pm

ÄrootÉdeadline-wwpu8 DeadlineRepositoryÅ# ls -alh total 56K drwxrwxr-x 14 forge_user forge_user 4.0K Feb 23 19:23 . drwxr-xr-x 7 forge_user forge_user 4.0K Feb 23 16:04 .. drwxrwxr-x 3 forge_user forge_user 4.0K Feb 23 13:27 api drwxrwxr-x 3 forge_user forge_user 4.0K Feb 23 13:27 balancer drwxrwxr-x 3 forge_user forge_user 4.0K Feb 23 13:27 bin drwxrwxr-x 3 forge_user forge_user 4.0K Feb 23 13:27 cloud drwxrwxr-x 9 forge_user forge_user 4.0K Feb 23 13:27 events drwxrwxr-x 3 forge_user forge_user 4.0K Feb 23 16:04 jobs drwxrwxr-x 11 forge_user forge_user 4.0K Feb 23 13:27 plugins drwxrwxr-x 2 forge_user forge_user 4.0K Feb 23 13:27 pythonsync drwxrwxr-x 12 forge_user forge_user 4.0K Feb 23 13:27 scripts drwxrwxr-x 2 forge_user forge_user 4.0K Feb 23 18:38 settings drwxrwxr-x 39 forge_user forge_user 4.0K Feb 23 13:27 submission drwxrwxr-x 2 forge_user forge_user 4.0K Feb 23 13:27 vmx ÄrootÉdeadline-wwpu8 DeadlineRepositoryÅ# ls balancer/ -alh total 12K drwxrwxr-x 3 forge_user forge_user 4.0K Feb 23 13:27 . drwxrwxr-x 14 forge_user forge_user 4.0K Feb 23 19:23 .. drwxrwxr-x 2 forge_user forge_user 4.0K Feb 23 13:27 DefaultAlgorithm ÄrootÉdeadline-wwpu8 DeadlineRepositoryÅ# ls cloud/ -alh total 12K drwxrwxr-x 3 forge_user forge_user 4.0K Feb 23 13:27 . drwxrwxr-x 14 forge_user forge_user 4.0K Feb 23 19:23 .. drwxrwxr-x 2 forge_user forge_user 4.0K Feb 23 17:37 Forge ÄrootÉdeadline-wwpu8 DeadlineRepositoryÅ# ls cloud/Forge/ -alh total 14K drwxrwxr-x 2 forge_user forge_user 4.0K Feb 23 17:37 . drwxrwxr-x 3 forge_user forge_user 4.0K Feb 23 13:27 .. -rwxrwxr-x 1 forge_user forge_user 2 Feb 23 17:37 Forge.param -rwxrwxr-x 1 forge_user forge_user 5.5K Feb 23 13:04 Forge.py

Isn’t it complaining about a MongoDB error?

eosiowy · February 23, 2016, 8:56pm

I was thinking that it could be a permissions thing because it doesn’t know the filename. If you remove the region and recreate it does that housekeeping error still happen?