I would say that that was implied by the word faster, but it’s a fair qualification to be sure.
Agreed. There are definitely challenges to having every Slave participate in this process once the scale reaches a certain point. As part of our design process, we’ll be looking at ways to reduce this signal-to-noise ratio. The most obvious approach being to not have all Slaves participate in the process.
Whether clients (e.g. Slaves) are talking to a database or talking to a dispatcher, they’re still doing so over a potentially faulty network connection. Compensating for state corruption is a ceaseless effort.
Just to play with numbers, if 2,000 slaves are each running 2 tasks in parallel, and if the average task duration is 300 seconds, that means a request for a new task would be coming into the dispatcher every 75 milliseconds on average. Can the dispatcher logic be evaluated every 75 milliseconds? Yes, that might be possible. What if this dispatcher is is being used in a financial or industrial situation where the task volume is 1,000 times greater? Can it still keep up? Doubtful. True, multithreaded code would be faster since the locking can happen between objects in memory rather than in the database. But the dispatcher still needs to read and update the state, and there is still state contention with other components that need to interact with the state. Another question is, does it make sense for thousands of nodes to all talk to one dispatcher? How is that any better than thousands of nodes all talking to one database? It’s still a network bottleneck, and it’s still solved by horizontal scaling.
So, what are the facts on the ground? The dispatcher will always need to deal with some degree of state contention. We can imagine cases where a single dispatcher process might not be able to keep up with the volume of task requests. A lone dispatcher would be a network bottleneck the same way that a lone database is a bottleneck. There is a robustness case for running multiple dispatchers. My conclusion: Any future design needs to allow for dispatchers running in parallel. Of course, if someone wanted to dial the spinner down to “1”, they could have a truly centralized dispatcher.
I think there are two parts to the “central-ness” that is spoken of. One part is the signal-to-noise that you mentioned, which I think is likely resolved by tuning the number of machines participating in dispatching. The other part is the perspective of the logic as being slave-centric vs job-centric vs whatever-centric. Since the goal is simply to pair tasks with Slaves, if the logic is customizable, then it can be written from whatever perspective is best for the use case.
I prefer to move away from the black-and-white fallacy of central vs distributed and instead look at the problem as a spectrum from which an optimal setting can be tuned for each use case.