Im wondering if the behavior of idle slaves could be tweaked to improve db performance? Currently, if we have a mostly idle farm, the monitor essentially becomes unusable. Everything takes minutes to execute…
Our effective lock hovers around 50-60% constantly, queue count is around 700… normally it should be <100…
This monday morning i had about 5-6 panic emails from all levels of production that “something is wrong with deadline”, because they can’t navigate the monitor. Looking at it, its an idle farm, and the db is having a really hard time coping with that.
What if slaves would throttle down their requests if they have been idle a while?
Another strong argument for a central dispatcher?
There’s already a repo setting for how often Idle Slaves check for jobs… It’s under Repo Options -> Slave Settings -> “Number of Seconds between queries for new Jobs while the Slave is Idle”
So a single idle slave in a busy environment that - due to its group/pool settings - doesn’t have anything to do for a couple minutes, currently uses the same value that 1000+ idle slaves would use when there are no jobs to render at all…
We have this set at 3x the default, at 90s already…
I’m thinking along the lines of: “Been ‘warm’ idle for 15 minutes, lets go to ‘cold’ idle where i only ping the db once every 5 mins”.
Almost like sleep mode from which the slave has to wake up to get back to a warm state (was there a central dispatcher, it could say “time to wake up, stuff is coming your way” with absolutely NO inbetween traffic while there is nothing to do…).
There’s still nothing stopping every machine (or even half or 1/3 of them) from hitting the server at the same time, or within a few ms of each other. Worst-case result: all it does it make the farm less responsive to new jobs, without changing the load patterns. More likely result at a large scale? It converts a constant moderate-to-heavy DB load to periodic (semi-)crippling load spikes.
The way we’ve set up the Slave queries, and the way mongo works should lead to these requests dispersing over time, not clumping up in spikes. You should be seeing a fairly even load distribution when increasing this setting – this should be more true with higher slave counts, not less so.
If this isn’t actually the case in practice, though, we’ll definitely have to revisit the way we’ve structured the Slave dequeue timing.