[7.1.0.30R] cloud zombies

ctjanney · April 28, 2015, 10:10pm

provider: google
DL7 version: 7.1.0.30R
DL and balancer on win7
render VMs centOS 6.6

When a job finishes, we see the nodes going offline in the google console, but in Monitor, they initially disappear, then pop back, but without group or pool assignment, and they no longer have a region.

eosiowy · April 28, 2015, 10:52pm

Ahh! This sounds like what I thought was happening. My theory is:

The Balancer is telling Google to shutdown the slave instances.
The Balancer gets an “all good” message from Google and removes the slaves from the Slave List.
Meanwhile, Google is spinning down the slave instances but they are actually fully turned off yet.
The slaves on those instances are still reporting their statuses to the repository, causing a new entry to be created.
The slaves instances shutdown for real but Google does not shutdown their instances gracefully so the Slave does not report that it’s shutdown to the repository.

Resulting in what you see here.

That’s definitely a bug. We’re going to have to change how the Balancer knows that a slave is shutdown. Unfortunately, there isn’t a great way of fixing this now. There’s a setting in Repository Options -> Slave Settings to Delete Offline/Stalled Slave after X many days. That’ll make it some the zombies don’t stay around for more than a day. You could write a script that cleans them up or you could do it manually, neither of which sound ideal.

Thanks for the screenshot and your patience guys.

Eric

ctjanney · April 29, 2015, 2:15am

Thanks for the response Eric. I understand you folks are under the gun to get the release out, so features are locked. That said, when do you think we’d see an update to cover this issue? In the mean time, what sort of script would I write to tell the balancer to drop these deadbeats? A cron job to look for nodes without group or pool and removes them?

Thanks for the help.

-ctj

eosiowy · April 29, 2015, 3:40pm

Yeah, 7.1 is feature locked so this fix won’t be in there. I don’t know if we’ll do a 7.1.1 or something for this. We usually only do those for major fixes. The first beta for 7.2 will definitely have fixed this. I wouldn’t be surprised to see that come out a week or two after 7.1 is released.

Yeah, the cron job is a good way to go about it. I’d probably look for stalled slaves with no group/pool and remove those.

Thanks,
Eric

MikeOwen · April 30, 2015, 3:53pm

FYI. We ended up back-porting this fix into a final…final build of v7.1

ctjanney · May 4, 2015, 4:14pm

sweet!