network outage -> dead farm

LaszloSebo · July 8, 2014, 4:39am

Hey guys,

We had some problems with sohonet today, which led to some random connectivity problems. Everything came back within minutes, but it seems like the mongo db could not handle whatever went down. After a couple of hours, the primary member gave up, started swapping with the secondary, while the machines all hung up with this message:

---- 2014/07/07 20:57 ----
An error occurred when updating the last write time in the database: Attempted to read past the end of the stream. (System.IO.EndOfStreamException)
Slave - An error occurred while updating the slave's info: An unexpected error occurred while interacting with the database (deadline.scanlinevfxla.com:27017,deadline01.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 172.18.4.58:27017 (FranticX.Database.DatabaseConnectionException)
---- 2014/07/07 21:04 ----
Slave - An error occurred while updating the slave's info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017,deadline01.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
Full error: Unable to connect to a member of the replica set matching the read preference Primary (FranticX.Database.DatabaseConnectionException)
Slave - An error occurred while updating the slave's info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017,deadline01.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
Full error: Unable to connect to a member of the replica set matching the read preference Primary (FranticX.Database.DatabaseConnectionException)
---- 2014/07/07 21:05 ----
Slave - An error occurred while updating the slave's info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017,deadline01.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
Full error: Unable to connect to a member of the replica set matching the read preference Primary (FranticX.Database.DatabaseConnectionException)
---- 2014/07/07 21:11 ----
Slave - An error occurred while updating the slave's info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017,deadline01.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
Full error: Unable to connect to a member of the replica set matching the read preference Primary (FranticX.Database.DatabaseConnectionException)
Slave - An error occurred while updating the slave's info: An unexpected error occurred while interacting with the database (deadline.scanlinevfxla.com:27017,deadline01.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Unable to write data to the transport connection: An established connection was aborted by the software in your host machine. (FranticX.Database.DatabaseConnectionException)
---- 2014/07/07 21:13 ----
Slave - An error occurred while updating the slave's info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017,deadline01.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
Full error: Unable to connect to a member of the replica set matching the read preference Primary (FranticX.Database.DatabaseConnectionException)
---- 2014/07/07 21:16 ----
Slave - An error occurred while updating the slave's info: An unexpected error occurred while interacting with the database (deadline.scanlinevfxla.com:27017,deadline01.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Unable to write data to the transport connection: An established connection was aborted by the software in your host machine. (FranticX.Database.DatabaseConnectionException)
---- 2014/07/07 21:17 ----
Slave - An error occurred while updating the slave's info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017,deadline01.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
Full error: Unable to connect to a member of the replica set matching the read preference Primary (FranticX.Database.DatabaseConnectionException)

Even after the replica set was healthy again, these machines never managed to reconnect again. This box for example has been hanging at 100% complete frame like that for hours, im writing this message at 21.38 and as you can see it didnt try to reconnect or finish the frame sime 21.17 (which is probably the time when the db came back). Yet, its still hanging there, not finishing…

and we have probably the whole farm like this (farm utilization is at 1-2%… while all slaves are shown as ‘rendering’)

LaszloSebo · July 8, 2014, 4:40am

The max process is still there, but it seems like the slave even after reconnecting, just sits there doing nothing

LaszloSebo · July 8, 2014, 5:03am

We are still experiencing issues, and large chunks of the farm lose connectivity to the deadline database… We will try to roll back to the previous version of deadline (0.24). We did the rollout of the release version only 12 hours ago, and my feeling is that these problems are related somehow (and potentially the sohonet outage is just an unrelated coincidence)

LaszloSebo · July 8, 2014, 5:09am

I get errors like this in the monitor:

2014-07-07 22:01:47: Error occurred while updating slave cache: Server instance deadline01.scanlinevfxla.com:27017 is no longer connected. (System.InvalidOperationException)
2014-07-07 22:09:07: Error occurred while updating Cloud Instances: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadline.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
2014-07-07 22:09:07: Timeout waiting for a MongoConnection. (FranticX.Database.DatabaseConnectionException)
2014-07-07 22:09:07: at Deadline.StorageDB.MongoDB.MongoDBUtils.HandleException(MongoServer server, Exception ex)
2014-07-07 22:09:07: at Deadline.StorageDB.MongoDB.MongoCloudStorage.GetCloudRegions(Boolean invalidateCache)
2014-07-07 22:09:07: at Deadline.StorageDB.CloudStorage.UpdateData()
2014-07-07 22:09:07: Error occurred while updating pulse cache: Timeout waiting for a MongoConnection. (System.TimeoutException)
2014-07-07 22:09:07: Error occurred while updating limit group cache: Timeout waiting for a MongoConnection. (System.TimeoutException)
2014-07-07 22:09:07: Error occurred while updating job cache: Timeout waiting for a MongoConnection. (System.TimeoutException)

LaszloSebo · July 8, 2014, 6:09am

We have rolled back 2/3 of the farm, and ended up rebooting the primary mongo server. So far, no connection errors since (been 25 mins).

So fingers crossed… we will do a gradual rollout again over the next few days to see if its related to the issues

jgaudet · July 8, 2014, 4:14pm

Hmmm, that definitely shouldn’t happen. We’ll have to do some testing here to replicate that behaviour.

In terms of your replica setup, where are all the machines in relation to each other? The network that went down clearly separated the primary from the Slaves, but did it separate the primary from the secondary/arbiter? Would the Slaves have been able to connect to the Secondary while? I’m mainly asking so we can best replicate this here by knowing which network cables to unplug, to hopefully get the same behaviour.

I’ll also be sure to check and see what we changed since rev 24, I don’t think it would be anything that affects DB connections, but you never know.

LaszloSebo · July 8, 2014, 4:20pm

Frankly, there was some magical voodoo stuff going on for hours, its really hard to tell. It seems that maybe the network interface on the primary server went into some weird state, because only a full reboot of it seemed to have resolved the situation. Since the outage took down the majority of the machines, its hard to isolate for example what ‘load drop’ was related to deadline going crazy, and what was simply the slaves going away…

It seemed like the machines could connect to the primary and secondary, and all replica members could contact each others, but they would randomly lose connection every now and then. And if that happened at a bad time, the replica set would go nuts (of course). I think that when i “thought” everything should be OK on the db side, the situation was still not resolved. The problem is, by the time we rebooted the main db server as a last resort, most slaves have been restarted and thus ‘bumped’ out of their hung up state.

jgaudet · July 8, 2014, 4:31pm

Gotcha, at least the reboot was able to resolve the issue. I wonder if it was some kinda weird DNS, or lingering connection issue server-side… Either way, it certainly won’t hurt us to do some more testing here to see if we can’t make things a bit more robust. These failures tend to be super unpredictable and kinda different each time, but there might be something we can improve when the Slave is trying to re-connect to a downed server.

Either way, here’s to hoping it doesn’t happen to you guys again anytime soon!