central dispatcher

Coulter · September 25, 2014, 4:51pm

I would say that that was implied by the word faster, but it’s a fair qualification to be sure.

Agreed. There are definitely challenges to having every Slave participate in this process once the scale reaches a certain point. As part of our design process, we’ll be looking at ways to reduce this signal-to-noise ratio. The most obvious approach being to not have all Slaves participate in the process.

Whether clients (e.g. Slaves) are talking to a database or talking to a dispatcher, they’re still doing so over a potentially faulty network connection. Compensating for state corruption is a ceaseless effort.

Just to play with numbers, if 2,000 slaves are each running 2 tasks in parallel, and if the average task duration is 300 seconds, that means a request for a new task would be coming into the dispatcher every 75 milliseconds on average. Can the dispatcher logic be evaluated every 75 milliseconds? Yes, that might be possible. What if this dispatcher is is being used in a financial or industrial situation where the task volume is 1,000 times greater? Can it still keep up? Doubtful. True, multithreaded code would be faster since the locking can happen between objects in memory rather than in the database. But the dispatcher still needs to read and update the state, and there is still state contention with other components that need to interact with the state. Another question is, does it make sense for thousands of nodes to all talk to one dispatcher? How is that any better than thousands of nodes all talking to one database? It’s still a network bottleneck, and it’s still solved by horizontal scaling.

So, what are the facts on the ground? The dispatcher will always need to deal with some degree of state contention. We can imagine cases where a single dispatcher process might not be able to keep up with the volume of task requests. A lone dispatcher would be a network bottleneck the same way that a lone database is a bottleneck. There is a robustness case for running multiple dispatchers. My conclusion: Any future design needs to allow for dispatchers running in parallel. Of course, if someone wanted to dial the spinner down to “1”, they could have a truly centralized dispatcher.

I think there are two parts to the “central-ness” that is spoken of. One part is the signal-to-noise that you mentioned, which I think is likely resolved by tuning the number of machines participating in dispatching. The other part is the perspective of the logic as being slave-centric vs job-centric vs whatever-centric. Since the goal is simply to pair tasks with Slaves, if the logic is customizable, then it can be written from whatever perspective is best for the use case.

I prefer to move away from the black-and-white fallacy of central vs distributed and instead look at the problem as a spectrum from which an optimal setting can be tuned for each use case.

nrusch · September 25, 2014, 5:43pm

That’s a fair point, but in terms of state corruption, a centralized system (with the dispatching server and the data store residing on the same machine) should theoretically be able to validate incoming state updates for completeness before they are committed.

That’s a good question, but I don’t think the best approach would be to re-evaluate the entire state atomically for each request.

To play with some more numbers, Tractor’s documentation claims their engine can dispatch over 500 tasks per second. Even if that figure is inflated to a ridiculous degree, and the reality is that it can only do 1/3 of that (say, 165 tasks per second), that’s still an average of only 6 ms per allocation. With that in mind, you have to imagine they’re doing something clever, and I have a feeling the Postgres backend plays into that.

Again, my answer to that would be the potential for reducing instances of state corruption and, in the case of a Mongo backend, a reduction in traffic contention (state updates could be queued in memory and committed in order by the dispatcher). It may be the case that Tractor blades actually write updates directly to the Postgres backend, and that makes use of DB triggers to provide a form of bidirectional communication with the dispatch server. The blade process is written in pure Python, so that would be easy enough to check.

Yes, if you throw enough connections at any system, it will choke. However, a system with multiple dispatchers suffers a performance hit just for showing up, and even if you can turn that number down to 1, unless you optimize (heavily) for that case in the dispatcher code, you probably wouldn’t be able to outdo a purpose-built solo dispatching engine. The obvious counter-argument would be that a distributed model should be able to scale better, which I can agree with on paper, but so far that hasn’t been the case.

I can’t agree with you on the perceived fallacy. From my perspective, this conversation is black-and-white. You’re either using a (purpose-built) single dispatcher, or you aren’t. The logic of the dispatching engine itself (e.g. slave- vs. job- vs. whatever-centric) can be changed independently of the distribution model.

Thanks for keeping this discussion going.

LaszloSebo · September 25, 2014, 6:40pm

I think this calculation is wrong. If it was right, you could not have database server serving large client bases either.
The question is not “could it handle a request every 75 milliseconds?” but, could it keep the farm busy with minimal downtime?
You might need to pool requests (and have them waiting for 1-2 seconds even), then process 2000 requests all at once under 10-20 ms, and then give the tasks back. Which will in essence result in the super low ~6ms averages that were referenced by Nrusch about tractor. Are slaves sometimes waiting a couple of seconds? Yes. Do you get ultimately excellent task serve averages? Yes.

In the current approach you have “2000 dispatchers” requiring ALL the db data to make educated decisions about what they should do. This generates an enormous amount of data, traffic, noise and processing overhead. Our current numbers for dequeuing vary between 30s-240seconds PER MACHINE per job. This “downtime” is on par with most of our nuke render frametimes. This means big $$ is sitting idle, sorting tasks on every machine, while also creating a huge load on the db server. I am afraid to make the maths how much money it costs in lost utilization.

We only have ~2000 machines, with 10-20k jobs in the queue. By any standards, this are not high numbers for a database.

But with such chatty clients, its basically breaking mongodb. And without a central dispatcher overlord, you have chatty clients.

The answer can’t simply be “buy faster servers, buy more servers” to a software architecture problem. We can’t plug more than 2x10gig lines, faster SSDs into our mongo server. Money can’t buy better machines than we have.

LaszloSebo · September 25, 2014, 6:45pm

Multiple dispatchers vs single dispatcher IS a deal breaker.

With a single dispatcher, you open up the ability to make actually educated decisions and farm level optimizations of your task distributions. With multiple dispatchers, you will never be able to, since they again operate independently of each other, unaware of each others’ decisions.

Just look at a simple example of limit groups. Machine X is blacklisted for 30 jobs, and is part of 10 limit groups.
EVERY time it looks for jobs, it has to query ALL the jobs, then cross reference them with an updated list of limit groups, and then filter that down. On every machine, thats 2000x queries to all jobs, limit groups etc.

Would it be a central dispatcher, it would query the limits ONCE and then query the jobs (1/2000 the query count), sort them to the appropriate slaves that are waiting to be served. It would be 1/2000th the db load. Add another dispatcher, and now you need to synchronize everything, for each query you need to make sure nothing was changed by the other dispatcher, adding back a lot of db requeries to tables that with a central dispatcher would NEVER be changing (since the dispatcher is alone responsible for the distribution).

Coulter · September 26, 2014, 7:47pm

Indeed, whenever components are more tightly coupled, the handling of state consistency is easier. I would stop short of saying that a centralized approach is the only way to get this tighter coupling and better state consistency. It’s more about the data architecture in general.

Right, there are obviously techniques that are more intelligent than a full dispatch logic run per request, such as running request fulfillment in batches and so forth. But these techniques are a form of vertical scaling (going faster rather than wider) and have an upper bound that can only be addressed through parallelization. Granted, sometimes vertical scaling is sufficient.

Perhaps it hasn’t been the case for 6.2 at a certain scale because of how chatty the Slave app is. This is improved in Deadline 7, and we’ll continue to improve our data architecture in Deadline 8. But in general, parallelization (or distribution) is the answer to scalability. Replete throughout modern data processing, be they big data systems like Hadoop or big compute systems like high performance computing clusters, all achieve their massive performance through a distributed model. This is because, in the vast majority of problems in the data processing space, horizontal scaling (parallization) is far far less limited than vertical scaling (making the processor or algorithm faster). Even sharding in MongoDB is a form of parallelization. The key, of course, is building a data architecture that is congruent with the distribution model. This is easier said than done, of course.

There is the state and there is the logic that operates on the state.

Let’s talk about the state first. The degree to which operations on the state can be parallelized depends on the contentedness of the state. In a best-case scenario, all objects in the state set are independent and can be independently operated upon. In a worse case scenario, the objects in the state set are super-connected and indivisible, such that the state set must be treated almost like a single object, forcing operations on it to be serial. But even in this worse case, there are usually independent subsets within the state that allow for pre-computation and thus parallelization, even if the final state change operations must be serial. In practice, it’s rare for a state set to be a super-connected and indivisible, especially as the dataset gets larger, opening the door to distributed operations.

Then there is the logic that operates on the state set. Assuming state operations are properly protected, it’s very rare that an an algorithm cannot be parallelized. In general, a properly constructed algorithm running in parallel with itself will scale, up to some point of diminishing returns. If this were not the case, then all of the cutting edge data processing solutions of our day would not be using distributed models.

Absolutely. We are all interested in making Deadline the best it can be. And we all wish we could get there faster!

nrusch · September 26, 2014, 9:43pm

I think there’s an important distinction that needs to be made here, because I feel like you’re at least partially blurring (or ignoring) the line between what Deadline can be used for and the way it works. Yes, Deadline can be used to perform heavy-duty distributed data processing, but its scheduler does not fall under that umbrella. While I don’t have direct experience with Hadoop myself, I feel strongly that drawing parallels between a scheduler and something like it is highly misleading.

On an infinite continuum, with no constraints, the statement “parallelization is the answer to scalability” fits just about every problem. When you take the context of this discussion into consideration, it must be qualified “…at the cost of performance.” For me, as an end user, this discussion is not about finding the best solution for unbounded scalability; Deadline is not Amazon Cloud. Rather, what the collective “we,” as users/studios with finite resources, are interested in is the sweet spot between scale and performance, and a 1:1 slave-to-dispatcher ratio is a long way from that. 1000:1 probably still isn’t there.

Coulter · September 29, 2014, 6:11pm

Why is that? Scheduling is right at the core of any large scale processing solution. If the scheduling logic cannot be parallelized, then there exists some scale at which it will become the bottleneck (yes, of course, unless there is some other unavoidable bottleneck that always gets hit first).

Parallelizaton always comes with overhead, but not necessarily at the cost of performance. Otherwise there would be no point in parallelizing anything. It’s more about Nash equilibrium where sacrificing some local performance (such as by introducing locking overhead) permits higher aggregate performance (the ability to get n-times as much done). I’m not arguing that cases exist where n = 1 is the optimal solution. I’m just saying that these cases are in the minority.

Yes, elastic resources like public clouds throw a curve ball into the equation. We already have customers asking us to enhance Deadline to scale to 20,000+ Slaves on clouds. So now we’re talking 20,000:1. Some of these customers want to run a single repository so that there are local Slaves and cloud Slaves in possibly multiple cloud regions. What happens when 20,000 cloud Slaves are talking to a local central dispatcher over an internet connection (or even 2,000 local Slaves talking to a central cloud dispatcher)? Most likely that much traffic will need to be routed through a proxy or broker within each cloud region. Wouldn’t it be handy if the regional broker could also manage some of the scheduling separately from the local dispatcher while still operating from a singe repository/database? So that’s just another distributed scheduling model.

For me, the takeaway is that distributed scheduling will be the default model for Deadline. But, I also think that there is a need to allow for the possibility of a Slave to not do its own scheduling. If the door is open to a mode where a Slave can get its task allocations from somewhere else, and if custom scheduling logic is possible, I think all the ingredients are in place to allow for a central dispatching model for those that desire it.

I cannot tell you the exact shape of Deadline in the future, but these are the kinds of things we are exploring in our design stages.

nrusch · September 30, 2014, 4:54pm

Yes, but it’s a fraction of the work you actually want to do with said processing solution. It’s not about whether the scheduling logic can be parallelized. We’ve established that. However, putting an idea out there like “the scheduler should be distributed because Hadoop is distributed” is misleading in drawing lines between the problem and the wrong solution (or the solution and the wrong part of the problem). Sure, you could build a system where you had 1000 scheduling nodes delegating tasks to 10 worker nodes, but that doesn’t make it a good idea.

The problem Deadline is facing right now is essentially one of premature optimization targeted at the wrong problem, except that it can’t handle the case it’s trying to optimize for either.

Yes, elastic resources like public clouds throw a curve ball into the equation. We already have customers asking us to enhance Deadline to scale to 20,000+ Slaves on clouds. So now we’re talking 20,000:1. Some of these customers want to run a single repository so that there are local Slaves and cloud Slaves in possibly multiple cloud regions. What happens when 20,000 cloud Slaves are talking to a local central dispatcher over an internet connection (or even 2,000 local Slaves talking to a central cloud dispatcher)? Most likely that much traffic will need to be routed through a proxy or broker within each cloud region. Wouldn’t it be handy if the regional broker could also manage some of the scheduling separately from the local dispatcher while still operating from a singe repository/database? So that’s just another distributed scheduling model.

What you’re talking about with the cloud proxies sounds interesting, but I think you may have missed my point. When I said, “Deadline is not Amazon Cloud,” I was implying that no one is trying to build an infinitely large render farm. Thus, trying to optimize for that case feels like a less-than-ideal allocation of resources.

I like the sound of this, and I’m looking forward to seeing how things progress in the future.

Coulter · September 30, 2014, 5:24pm

No, no, I’m not saying we should do it because they do it, nor am I trying to obfuscate or mislead anyone. It was more about the general approach to scalability and performance used in modern large scale data processing apps. The design patterns for scalability are well-established (which is why the examples I gave use them). We’re not inventing them. We’re just working to find the best ways to appy them. Building a scheduler that does not parallelize would be short-sighted. With canned scheduling logic, it’s possible to build a scheduler that has some arbitrary performance target for the dispatching rate. But if we allow custom logic, then some customers may have logic that can only dispatch at say, 10 tasks per second, due perhaps to some special external query that needs to happen. My preference is for a design for parallelized scheduling that would also allow for the possibility of a customized central dispatcher.

Except that we have customers (plural!) who want to do exactly that - build infinitely-scalable render farms - and they prefer Deadline over other solutions. This is just one of the many challenging use-cases we need to encompass in our designs. What I’m probably not successfully conveying is that we cannot optimize Deadline to any one use case. Our go-forward challenge is partly to solve the data pipeline issues we see in DL 6 and partly to allow Deadline to be customized to the increasingly diverse use cases we are seeing. As a designer, my approach to that kind of problem is to push towards an architecture that opens up areas like scheduling and task routing so that we, or our customers, can plug in components that are tailored to the needs of the various use cases. A central dispatcher is a special-case subset of distributed scheduling, meaning we build for the wider case and allow for the narrow.