I’ve been working on this bug for a few days. We have a custom ‘cloud’ plugin which worked in 7.1.2.1, but doesnt in 8 (tried in 8.0.0.50+).
During the balancer cycle, a single host is requested. This is returned as a list from CreateInstances. Then, on the next cycle, GetActiveInstances finds this new host, and returns it.
However, for some reason, Balancer does not see the active instance as valid, and tries to start another one. This repeats endlessly each cycle.
I wrote the simplest possible cloud plugin which just mocks up the host creation (ie, no external systems involved). It is still broken!
I’ve taken a look at your simple plugin and what you’re experiencing is actually expected behavior. Let me explain.
The first Balancer cycle starts. The Balancer checks all the running instances and compares them to an internal list of instances it’s created. We do this to make sure that we have an accurate picture of the instances we’ve started. In our simple plugin case, we haven’t started any instances yet but we see 1 instance out there. That’s fine. The Balancer doesn’t touch instances it hasn’t created so it’s ignored. Then we determine our targets and we see that we need to start 1 instance. Which is true, the Balancer hasn’t started any instances yet and there’s work to be done so we need to start an instance. Now the Balancer records the info for our newly created instance to our internal list.
Then the second Balancer cycle happens. This time the result is different. We check our active instances and compare it to our internal list and what do you know, there’s a match. Looks like the instance the Balancer started is still there. Then we determine our targets, it’s still 1 because there’s still work to do and then we move on to the Equalizing step and the Balancer sees that it’s already started 1 instance and that fits our target so we don’t need to start any more.
Here you can see my Balancer after multiple cycles. It only tries to create an instance the first time.
Interestingly, when you run that plugin it works! When I run it, I get very different output. It keeps trying to start a new slave - it never matches the internal state. Here is the log output from a few balancer cycles:
Worker Thread (665): Updating the State...
State Cache (665): Setup...
State Cache (665): Populating Cloud Regions...
State Cache (665): Finished populating 1 Cloud Regions.
State Cache (665): Populating Slave data...
State Cache (665): Finished populating 1 Slaves.
State Cache (665): Populating Limits data...
State Cache (665): Finished populating 1 Limit groups.
State Cache (665): Populating Job data...
State Cache (665): Finished populating 1 Jobs.
State Cache (665): Populating Groups data...
State Cache (665): Finished populating 2 Groups.
Worker Thread (665): Executing the Algorithm: DefaultAlgorithm...
Creating Balancer State Struct
Computing Demand
Group (nukex): TaskCount: (10.0) Weight: (500.0)
Determining the Available Resources
Computing Targets
Populating Targets
Group: nukex Target: 1
Group: modo Target: 0
Algorithm (665): Targets:
Algorithm (665): Region: forge
Algorithm (665): Group: nukex: 1
Algorithm (665): Group: modo: 0
Worker Thread (665): Equalizing targets...
Equalizer (665): Region: forge : Enabled
PYTHON: *** GetActiveInstances ***
PYTHON: *** populate_CloudInstance ***
PYTHON: ---- ID 123456
PYTHON: ---- Name ForgeTestInstance
PYTHON: ---- Hostname ForgeTestHost
PYTHON: ---- InstanceStatus 2
PYTHON: ---- ImageID ForgeTestImage
PYTHON: ---- HardwareID ForgeTestHardware
PYTHON: ---- PublicIP
PYTHON: ---- Provider Forge
PYTHON: ---- GroupName nukex
PYTHON: ---- RegionName forge
PYTHON: ---- Zone forge
PYTHON: ------ Appending 1793703675
PYTHON: ------ Active 1
PYTHON: returning Ä<Deadline.Cloud.CloudInstance object at 0x7fb432f56518>Å
Equalizer (665): Group: modo: Enabled 0 / 0
Equalizer (665): Group: nukex: Enabled 0 / 1
Equalizer (665): Requesting 1 new instances be started.
PYTHON: *** CreateInstances ***
PYTHON: *** populate_CloudInstance ***
PYTHON: ---- ID 123456
PYTHON: ---- Name ForgeTestInstance
PYTHON: ---- Hostname ForgeTestHost
PYTHON: ---- InstanceStatus 2
PYTHON: ---- ImageID ForgeTestImage
PYTHON: ---- HardwareID ForgeTestHardware
PYTHON: ---- PublicIP
PYTHON: ---- Provider Forge
PYTHON: ---- GroupName nukex
PYTHON: ---- RegionName forge
PYTHON: ---- Zone forge
PYTHON: Starting 1793703675
PYTHON: returning Ä<Deadline.Cloud.CloudInstance object at 0x7fb432f56518>Å
Worker Thread (12): Waiting 20 seconds before next cycle.
Worker Thread (666): House Keeping...
HouseKeeper (666): Beginning house keeping...
HouseKeeper (666): House keeping completed.
Worker Thread (666): Updating the State...
State Cache (666): Setup...
State Cache (666): Populating Cloud Regions...
State Cache (666): Finished populating 1 Cloud Regions.
State Cache (666): Populating Slave data...
State Cache (666): Finished populating 1 Slaves.
State Cache (666): Populating Limits data...
State Cache (666): Finished populating 1 Limit groups.
State Cache (666): Populating Job data...
State Cache (666): Finished populating 1 Jobs.
State Cache (666): Populating Groups data...
State Cache (666): Finished populating 2 Groups.
Worker Thread (666): Executing the Algorithm: DefaultAlgorithm...
Creating Balancer State Struct
Computing Demand
Group (nukex): TaskCount: (10.0) Weight: (500.0)
Determining the Available Resources
Computing Targets
Populating Targets
Group: nukex Target: 1
Group: modo Target: 0
Algorithm (666): Targets:
Algorithm (666): Region: forge
Algorithm (666): Group: nukex: 1
Algorithm (666): Group: modo: 0
Worker Thread (666): Equalizing targets...
Equalizer (666): Region: forge : Enabled
PYTHON: *** GetActiveInstances ***
PYTHON: *** populate_CloudInstance ***
PYTHON: ---- ID 123456
PYTHON: ---- Name ForgeTestInstance
PYTHON: ---- Hostname ForgeTestHost
PYTHON: ---- InstanceStatus 2
PYTHON: ---- ImageID ForgeTestImage
PYTHON: ---- HardwareID ForgeTestHardware
PYTHON: ---- PublicIP
PYTHON: ---- Provider Forge
PYTHON: ---- GroupName nukex
PYTHON: ---- RegionName forge
PYTHON: ---- Zone forge
PYTHON: ------ Appending 1793703675
PYTHON: ------ Active 1
PYTHON: returning Ä<Deadline.Cloud.CloudInstance object at 0x7fb432f469e0>Å
Equalizer (666): Group: modo: Enabled 0 / 0
Equalizer (666): Group: nukex: Enabled 0 / 1
Equalizer (666): Requesting 1 new instances be started.
PYTHON: *** CreateInstances ***
PYTHON: *** populate_CloudInstance ***
PYTHON: ---- ID 123456
PYTHON: ---- Name ForgeTestInstance
PYTHON: ---- Hostname ForgeTestHost
PYTHON: ---- InstanceStatus 2
PYTHON: ---- ImageID ForgeTestImage
PYTHON: ---- HardwareID ForgeTestHardware
PYTHON: ---- PublicIP
PYTHON: ---- Provider Forge
PYTHON: ---- GroupName nukex
PYTHON: ---- RegionName forge
PYTHON: ---- Zone forge
PYTHON: Starting 1793703675
PYTHON: returning Ä<Deadline.Cloud.CloudInstance object at 0x7fb432f469e0>Å
So I did have to add a .param file with the py file. I was getting some weird behavior if I left it blank, so I added a control to it. Did you create a param file? Did you have to put anything in it?
I found this in the balancer log, a lot earlier. Is it relevant?
Worker Thread (12): Waiting 20 seconds before next cycle.
Worker Thread (665): House Keeping...
HouseKeeper (665): Beginning house keeping...
System.NullReferenceException: Object reference not set to an instance of an object
at Deadline.StorageDB.MongoDB.MongoCloudStorage.GetCloudRegion (System.String cloudRegion, Boolean invalidateCache) <0x41961ee0 + 0x00041> in <filename unknown>:0
at Deadline.Balancer.BalancerHouseKeeper.HouseKeep (System.String& errorMessage) <0x41921720 + 0x0031e> in <filename unknown>:0
I was thinking that it could be a permissions thing because it doesn’t know the filename. If you remove the region and recreate it does that housekeeping error still happen?