Hi there,
We are looking into sharding the mongo db, and wanted to ask you guys if you have any experience with it yet…
Im curious:
- if you have multiple shards on a single machine, does ram usage increase linearly? Say, non sharded db instance uses 40gigs of ram. Would 2 sharded instances use 2x20gigs, or 2x40gigs?
- same question for disk space usage… with high performance SSD raids, you get speed but not abundant space, so it becomes a factor in planning
- can a non-sharded db easily be sharded live, or you have to start a new db?
- how does deadline behave in a sharded environment? Any experience on that field?
Hey Laszlo,
Please see my comments below.
- Each shard runs its own mongod process which uses RAM based on the working set of that Mongod instance. So every mongod doesn’t use same RAM and it depends upon the working set and data usage. But it is highly recommended to have each shard on separate server.
- The purpose of sharding is to add disk space dynamically by adding additional servers (shard) as and when required. There is an internal process in MongoDB which manages the space and move the chunks of data across shards. by default chunk size is set to 64 MB but you can always change that. Also you can set the max size of the shard, which will stop balancer to move chunks to the shard when its max size is reached.
- Yes, existing DB can be sharded but for that some downtime is required to configure sharding.
- We have tested the sharding on Deadline and it has improved the performance. We would recommend to use Jobs, JobTasks and LimitGroups collection to be sharded with “_id” as the shard key.
Also you need to have multiple Mongos instance to run which will route the query from deadline application to MongoDB. Mongos is a lightweight process so you can run that on any application server or VM instance and then use the address of these mongos to connect to db through repository.
Thanks!
Vibhu
Thanks for the responses Vibhu!
Our primary goal with sharding would be to reduce connections to the individual servers, thus decrease locking and increase performance. Drive space is a factor, but not a major one. Using the _id field would randomly distribute the jobs across shards, meaning that most operations (like slaves dequeuing jobs) would have to query all shards. Which would result an equal number of connections as we currently have on each server.
Is that correct?
1: ok cool, we will investigate the separate server. We scaled our central server vertically, so it has a lot of unused resources right now that make it an easy choice for multiple sharding
re: 4: shard keys:
JobTasks:
What do you think about using the jobId field? That way all tasks of one job would be on the same shard. If the _id field is used, the tasks of a job would be randomly distributed among different shards, essentially forcing every client to connect to all shards when querying all the tasks for a single job (say, monitor update).
However, using the jobID field, all tasks of a single job being on one shard, most typical deadline task related operations would result in reduced connection count (dequeue next task of the same job, clicking on a job in deadline monitor to get its task listing etc).
Hey Laszlo,
Yes, it will create connections to different server for getting the jobs. since each server has its own db and there will be locking on that specific db only and mongos will be able to query jobs from other server. It is basically extension of splitting dbs. To decrease the number of connections per server, MongoDB recommends to have separate shard on separate server.
You cannot use JobID as shard key for JobTasks collection as deadline uses “_id” for most of the read operations and MongoDB requires shard key to be passed with the query during write operation. In deadline7 we have changed the _id to have JobID prefixed to it so that way it will have all the tasks for particular job on same server.
-Vibhu
Thanks Vibhu, when we are getting closer to a shard rollout, i might again post some questions