AWS Thinkbox Discussion Forums

mongodb replica sets

Hi there,

Do we need to do any special configuration for deadline to use mongodb replica sets?

Also, the mongodb documentation mentions that replica sets can be used to increase read capacity:
“In some cases, you can use replication to increase read capacity. Clients have the ability to send read and write
operations to different servers.”

Is this something that would work with deadline?

cheers,
laszlo

No, nothing special needs to be done to use replica sets.

You can’t however use them to increase read capacity. The big reason is that reads from the replica sets have eventual consistency, meaning they won’t always match the primary server. That’s where we would like to see sharding come into play (to spread the load). Unfortunately though, because Deadline uses some server-side java scripts, it can’t work with sharding yet. This is something we want to address in Deadline 7.

Cheers,
Ryan

Could the monitors be pointed to the secondaries? They are always a bit out of date anyway

If replica sets cant be used to reduce load, and sharding does not work, we can’t really make deadline scale better without beefing up the primary machine, correct?

Sidenote: can replica sets be used to live migrate the database to another machine? Something like:

  • Set up a secondary machine
  • declare the 2 machines as a replica set
  • remove the original primary (essentially making the previous secondary now the primary box)

sharding will come, it’s on our roadmap. just not for 6.1

the interim approach would be vertical scaling [faster machine]

cb

We made a small test setup for the replica sets, and did a simulated failure.

The secondary mongo box successfully got elected to be primary within seconds, however deadline never picked up on that. Its been timing out since we took down the original primary basically.

Does deadline need to be configured special to make this work?

When doing the simulated failure, the monitor would throw these errors:

2013-12-10 15:38:43:  Error occurred while updating job cache: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host. (System.IO.IOException)
2013-12-10 15:38:44:  Error occurred while updating slave cache: Unable to connect to server deadline03.scanlinevfxla.com:27017: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.. (MongoDB.Driver.MongoConnectionException)
2013-12-10 15:38:47:  Error occurred while updating slave reports: An error occurred while trying to connect to the Database (deadline03.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
2013-12-10 15:38:47:  Full error: Unable to connect to server deadline03.scanlinevfxla.com:27017: No connection could be made because the target machine actively refused it 172.18.1.107:27017. (FranticX.Database.DatabaseConnectionException)
2013-12-10 15:38:48:  Error occurred while updating task cache: An error occurred while trying to connect to the Database (deadline03.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
2013-12-10 15:38:48:  Full error: Unable to connect to server deadline03.scanlinevfxla.com:27017: No connection could be made because the target machine actively refused it 172.18.1.107:27017. (FranticX.Database.DatabaseConnectionException)
2013-12-10 15:38:49:  Error occurred while updating job reports: An error occurred while trying to connect to the Database (deadline03.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
2013-12-10 15:38:49:  Full error: Unable to connect to server deadline03.scanlinevfxla.com:27017: No connection could be made because the target machine actively refused it 172.18.1.107:27017. (FranticX.Database.DatabaseConnectionException)
2013-12-10 15:38:50:  Error occurred while updating limit group cache: Unable to connect to server deadline03.scanlinevfxla.com:27017: No connection could be made because the target machine actively refused it 172.18.1.107:27017. (MongoDB.Driver.MongoConnectionException)

However once the other machine was elected primary, its still failing with similar errors:

2013-12-10 15:39:45:  Error occurred while updating pulse cache: Unable to connect to server deadline03.scanlinevfxla.com:27017: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 172.18.1.107:27017. (MongoDB.Driver.MongoConnectionException)
2013-12-10 15:40:27:  Error occurred while updating Cloud Instances: An error occurred while trying to connect to the Database (deadline03.scanlinevfxla.com:27017). It is possible that the Mongo Database server is incorrectly configured, currently offline, blocked by a firewall, or experiencing network issues.
2013-12-10 15:40:27:  Full error: Unable to connect to server deadline03.scanlinevfxla.com:27017: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 172.18.1.107:27017. (FranticX.Database.DatabaseConnectionException)
2013-12-10 15:40:27:     at Deadline.StorageDB.MongoDB.MongoDBUtils.HandleException(MongoServer server, Exception ex)
2013-12-10 15:40:27:     at Deadline.StorageDB.MongoDB.MongoCloudStorage.GetCloudRegions(Boolean invalidateCache)
2013-12-10 15:40:27:     at Deadline.StorageDB.CloudStorage.UpdateData()

I tried it from pymongo, and that worked just fine:

>import pymongo
>client = pymongo.MongoClient("deadline03.scanlinevfxla.com", replicaSet="deadlineTestRS0")
>client.nodes
set([(u'deadline01.scanlinevfxla.com', 27017),
     (u'deadline03.scanlinevfxla.com', 27017)])
>client.test.test.find_one()

Then I took down deadline03:

>client.test.test.find_one()
Exception: AutoReconnect: could not connect to deadline03.scanlinevfxla.com:27017: [Errno 10061] No connection could be made because the target machine actively refused it

10 seconds later, when deadline01 became the primary, i tried again:

>client.test.test.find_one()

And it was fine.

Is there a replica set name i should be using to make deadline recognize it as such?

Hey Laszlo,

You should just need to specify both servers, separated by a semicolon, in the dbConnect.xml file in the settings folder in the Repository. For example:

<Hostname>deadline01.scanlinevfxla.com; deadline03.scanlinevfxla.com</Hostname>

We should add support for the Replica Set Name property though, which will allow you to set up replica sets without needing to specify every single host in the dbConnect.xml file:
docs.mongodb.org/manual/referenc … on-string/

Cheers,
Ryan

Cool, ill test this today! The xml answers a question i forgot to ask, what happens if a machine starts a new process (deadlinecommand), which cant query the node list from the main primary defined… But if they are all in the xml, that’s not a problem!

thanks
l

Does the xml get re-read every now and then? Or only on startup?

Only on startup.

Works beautifully!

Privacy | Site terms | Cookie preferences