Hey guys,
We had some problems with sohonet today, which led to some random connectivity problems. Everything came back within minutes, but it seems like the mongo db could not handle whatever went down. After a couple of hours, the primary member gave up, started swapping with the secondary, while the machines all hung up with this message:
---- 2014/07/07 20:57 ----
An error occurred when updating the last write time in the database: Attempted to read past the end of the stream. (System.IO.EndOfStreamException)
Slave - An error occurred while updating the slave's info: An unexpected error occurred while interacting with the database (deadline.scanlinevfxla.com:27017,deadline01.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 172.18.4.58:27017 (FranticX.Database.DatabaseConnectionException)
---- 2014/07/07 21:04 ----
Slave - An error occurred while updating the slave's info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017,deadline01.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
Full error: Unable to connect to a member of the replica set matching the read preference Primary (FranticX.Database.DatabaseConnectionException)
Slave - An error occurred while updating the slave's info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017,deadline01.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
Full error: Unable to connect to a member of the replica set matching the read preference Primary (FranticX.Database.DatabaseConnectionException)
---- 2014/07/07 21:05 ----
Slave - An error occurred while updating the slave's info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017,deadline01.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
Full error: Unable to connect to a member of the replica set matching the read preference Primary (FranticX.Database.DatabaseConnectionException)
---- 2014/07/07 21:11 ----
Slave - An error occurred while updating the slave's info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017,deadline01.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
Full error: Unable to connect to a member of the replica set matching the read preference Primary (FranticX.Database.DatabaseConnectionException)
Slave - An error occurred while updating the slave's info: An unexpected error occurred while interacting with the database (deadline.scanlinevfxla.com:27017,deadline01.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Unable to write data to the transport connection: An established connection was aborted by the software in your host machine. (FranticX.Database.DatabaseConnectionException)
---- 2014/07/07 21:13 ----
Slave - An error occurred while updating the slave's info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017,deadline01.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
Full error: Unable to connect to a member of the replica set matching the read preference Primary (FranticX.Database.DatabaseConnectionException)
---- 2014/07/07 21:16 ----
Slave - An error occurred while updating the slave's info: An unexpected error occurred while interacting with the database (deadline.scanlinevfxla.com:27017,deadline01.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Unable to write data to the transport connection: An established connection was aborted by the software in your host machine. (FranticX.Database.DatabaseConnectionException)
---- 2014/07/07 21:17 ----
Slave - An error occurred while updating the slave's info: An error occurred while trying to connect to the Database (deadline.scanlinevfxla.com:27017,deadline01.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
Full error: Unable to connect to a member of the replica set matching the read preference Primary (FranticX.Database.DatabaseConnectionException)
Even after the replica set was healthy again, these machines never managed to reconnect again. This box for example has been hanging at 100% complete frame like that for hours, im writing this message at 21.38 and as you can see it didnt try to reconnect or finish the frame sime 21.17 (which is probably the time when the db came back). Yet, its still hanging there, not finishing…
and we have probably the whole farm like this (farm utilization is at 1-2%… while all slaves are shown as ‘rendering’)