DL 7.2.0.15 missing cloud machines in cloud panel

ctjanney · August 31, 2015, 10:58pm

DL 7.2.0.15R
Win 7 x64 - local
centos - cloud

When I started a render node manually (balancer seems to have taken a dirt nap in this update…I’ll update you after a hard re-start), suddenly my zfs file server disappeared from the cloud panel, causing a bit of a heart attack, but a quick list from the gcloud commandline showed it still running. My new render VM didn’t show up here either, but did show up in the ‘slaves’ panel.

Coulter · August 31, 2015, 11:26pm

No ideas immediately come to mind. I suggest checking the Console panel for errors. Similarly, any error message with regards to Balancer would be helpful.

ctjanney · September 1, 2015, 12:33am

Found the reason the balancer wasn’t starting. The update changed our cloud prefs for the blancer to ‘false’. I’ve attached the balancer log and another image of what I’m seeing.

starting from scratch, stop the balancer, delete all running VMs. We are at ground zero.

start the blancer with a budget of 10.00 with VMs at 0.5 should = 20 VMs
the balancer spins up 20 VMs assigned to cloud pool and cloud group
our zfs server disappears from the cloud panel as soon as the renders start
while the render VMs are coming up, they are blue in the slaves panel, and they have the correct pool and group assignment.
the slave VMs do not show up in the cloud panel
when the slaves finish coming up and are ready to render, they go into ‘idle’ state, and the pool and group designations are gone.
now I have 20 idle VMs without pool or group designations.
another 20 VMs are spun up, with correct pool and group assignments
repeat…

Eventually I go to our gcloud web console and delete the 80+ VMs now idle on our farm and turn the balancer off.

The dead VMs are still listed in my slave panel, some of them listed as stalled, the rest offline.
balancerlog02.txt (20.2 KB)

Coulter · September 1, 2015, 1:36am

This was likely a side effect of a some refactoring that happened, and should not happen again with future updates.

The rest, however, sounds like a serious disconnect in VM tracking. We’ll run some tests and see what we can find.

eosiowy · September 1, 2015, 3:49pm

Looks like the Balancer is erroring when it tries to get all the instances and compare it to the ones that are tracked.

I’ll look into the slaves not replacing their placeholders. It’s probably an issue with hostnames. Are you using Windows instances? Has the image been syspreped?

Thanks,
Eric

ctjanney · September 1, 2015, 7:35pm

we’re running centos instances on the cloud.

we downgraded to 7.2.0.13 this morning and cloud is working as expected. We may try 15R tomorrow morning, but we’re a little gun shy after the balancer spun up 120 x 32 core instances (instead of the 20 the budget allowed for) before I shut the blanacer down and removed the instances via the google dashboard. It was not a cheap oops.

-ctj

Coulter · September 1, 2015, 10:46pm

Thanks for the update Chris. This is very bad. Balancer takes great measures to prevent excess VMs from being launched, but you found an edge case where a status call to the provider returned an error that was subsequently interpreted as a count of zero VMs. We are looking at adding a check such that if a cloud API call errors out, Balancer will not request more VMs to be launched.

ctjanney · September 9, 2015, 12:26am

Any ETA for an updated balancer?

Thanks,

-ctj

Coulter · September 9, 2015, 2:45pm

We’re going to try to have this ready for the next build, but we’re still working on it. If it makes it in, you’ll see it in the build notes.

ctjanney · September 9, 2015, 7:49pm

thank you sir.