When I started a render node manually (balancer seems to have taken a dirt nap in this update…I’ll update you after a hard re-start), suddenly my zfs file server disappeared from the cloud panel, causing a bit of a heart attack, but a quick list from the gcloud commandline showed it still running. My new render VM didn’t show up here either, but did show up in the ‘slaves’ panel.
No ideas immediately come to mind. I suggest checking the Console panel for errors. Similarly, any error message with regards to Balancer would be helpful.
Found the reason the balancer wasn’t starting. The update changed our cloud prefs for the blancer to ‘false’. I’ve attached the balancer log and another image of what I’m seeing.
starting from scratch, stop the balancer, delete all running VMs. We are at ground zero.
start the blancer with a budget of 10.00 with VMs at 0.5 should = 20 VMs
the balancer spins up 20 VMs assigned to cloud pool and cloud group
our zfs server disappears from the cloud panel as soon as the renders start
while the render VMs are coming up, they are blue in the slaves panel, and they have the correct pool and group assignment.
the slave VMs do not show up in the cloud panel
when the slaves finish coming up and are ready to render, they go into ‘idle’ state, and the pool and group designations are gone.
now I have 20 idle VMs without pool or group designations.
another 20 VMs are spun up, with correct pool and group assignments
repeat…
Eventually I go to our gcloud web console and delete the 80+ VMs now idle on our farm and turn the balancer off.
The dead VMs are still listed in my slave panel, some of them listed as stalled, the rest offline. balancerlog02.txt (20.2 KB)
Looks like the Balancer is erroring when it tries to get all the instances and compare it to the ones that are tracked.
I’ll look into the slaves not replacing their placeholders. It’s probably an issue with hostnames. Are you using Windows instances? Has the image been syspreped?
we downgraded to 7.2.0.13 this morning and cloud is working as expected. We may try 15R tomorrow morning, but we’re a little gun shy after the balancer spun up 120 x 32 core instances (instead of the 20 the budget allowed for) before I shut the blanacer down and removed the instances via the google dashboard. It was not a cheap oops.
Thanks for the update Chris. This is very bad. Balancer takes great measures to prevent excess VMs from being launched, but you found an edge case where a status call to the provider returned an error that was subsequently interpreted as a count of zero VMs. We are looking at adding a check such that if a cloud API call errors out, Balancer will not request more VMs to be launched.