Working with remote queues

nrusch · April 8, 2014, 2:06am

One of the biggest unsolved problems for us when looking at integrating Deadline into our workflow is the difficulty of working with remote repositories (like… really remote). I find it hard to believe we would be the only studio dealing with this, so I’m curious how others have approached it. I’m also wondering if anyone else is dealing with a high-latency remote location that doesn’t play nice with a direct filesystem mount.

I know I’ve mentioned this before (probably during one of the previous beta cycles), but it would be really great if the monitor could be pointed directly at a Mongo server, rather than relying on disk access to the repository just to see the queue. Obviously there would be some operations that couldn’t be performed at this point, but eventually, it seems like Pulse would be a logical delegate for a “man-in-the-middle” for things like executing scripts remotely, serving task logs, etc. Having to set up a stub repository directory just to tell Deadline where to find the server isn’t really a sustainable solution.

Anyone care to chip in here? Are we really the only ones working with extreme amounts of distance between locations?

Thanks in advance.

rrussell · April 8, 2014, 2:49pm

Ultimately, we would like to support this as an option, and it’s something we’re in the very early stages of investigating.

For now, using a stub repository essentially gives you the behavior you would see if we supported a direct connection today. The database connection information is local, and you can’t run scripts or view task logs. Some clients of ours have used this approach if their server connection was flaky, but there are others where the connection is stable enough for this not to be a problem.

What’s are your issues with using the stub repository for now?

Cheers,
Ryan

cbond · April 8, 2014, 2:59pm

Anyone care to chip in here? Are we really the only ones working with extreme amounts of distance between locations?

you arent the only one - and people have had a number of approaches. you can, for example - run the repository as a synced copy with each location pointing to the local repository.

cb

rrussell · April 8, 2014, 3:05pm

Oops, that’s actually incorrect. As long as your scripts are synced up in the local repository, this wouldn’t be an issue.

nrusch · April 8, 2014, 10:56pm

Keeping things synced properly is the big one, and something I’m just not interested in dealing with in production.

The rearchitecting of Deadline around the Mongo server is crying out for a server-centric approach for the repository (as opposed to directory-centric). I have to imagine this is a situation that’s only going to become more and more widespread, and there are some solutions out there (Tractor, for instance) that make this kind of remote administration really seamless (including serving task logs).

It’s good to know that this is on the roadmap. The ability to execute scripts remotely is a nice suprise as well, actually.

nrusch · April 10, 2014, 7:20pm

Also, just out of curiosity, now that Deadline is reliant on a server process (thus elimintating the pre-6.0 possibility for slaves to operate “in the dark”), has there been any thought in the direction of reorganizing the scheduling behavior around a central dispatcher?

rrussell · April 10, 2014, 8:33pm

Prior to version 6, the slaves never really worked “in the dark” because your file server had to be on line for Deadline to operate. The key thing was that our slaves worked independently, and didn’t depend on a central scheduler to hand out tasks. We designed it this way because as a VFX studio (which we were at the time), we simply couldn’t afford to worry about the central scheduler going down and bringing the farm to a halt.

The only thing that has changed in version 6 is the additional requirement of having the database on line as well. However, replica sets can be used to back up the database and automatically fail over if the primary database goes down. We also no longer need Pulse as a proxy for larger farms, since the database can handle the load. So the slaves still work independently, just like they did in v5, and we really haven’t seen a need to change this.

Cheers,
Ryan

nrusch · May 27, 2014, 11:23pm

I wanted to follow up and mention a couple of things related to working with remote queues that are still pretty awkward.

Pulse serving of logs is a Monitor option rather than a connection option, which I found surprising. This means that checking on a remote farm becomes a two-step process (not only do I have to redirect the Monitor, but I then have to open the options and enable “Stream Job Logs from Pulse”).
Only one instance of the Monitor can be running at a time. This means that we can’t have someone wrangling all of our jobs and slaves at once (for instance, if we’re using Melbourne’s farm while they’re out of the office), and if someone needs to take a look at the remote jobs, they have to go through the process described above.

I guess one way to help ease the pain of transitioning between local and remote repositories would be to leave the option to stream logs through Pulse on all the time, but I don’t know what sorts of performance implications that might have at a large scale (e.g. if everyone in the building had it on).

As far as the single-use Monitor instance goes, is there anything that can be done to help solve this problem? I’m guessing this was put in place to try and cut down on unnecessary server traffic if possible, but are there other reasons?

Thanks

cbond · May 27, 2014, 11:55pm

nathan

i still dont understand why you have two logical farms rather than one - is that a decision you have made on purpose? why can’t you have one mongo database serving both locations?

cb

nrusch · May 28, 2014, 2:09am

I realize that the degree to which we have been yammering for these remote usability features probably seems somewhat silly, but believe me when I say that it really is the most sustainable and least risky configuration for the studio setup we’re dealing with.

Here are the main reasons for this decision:

Latency. This is why we use separate file systems, separate production database servers (albeit with master-master replication), separate license servers, etc. We’re looking at 200ms consistently between here and there, and Deadline does not handle it gracefully, even with an empty farm.
Fault tolerance. If someone drives a submarine through the undersea fiber line, both studios will continue to operate.

In terms of latency, things like write lock juggling between the local and remote locations will just make the situation worse (even for people sitting next to the Mongo server), since the entire Deadline database gets locked for every tiny status update… Imagine 250-300 slaves all trying to update job, task, and slave statuses in the DB in the same second. Replica sets may help alleviate some issues (mainly feedback-related, I’m guessing), but I don’t actually know if there are any gotchas to using Deadline with replica sets, or if the amount of regained performance would be enough to make it worthwhile (I know they are mentioned briefly in the installer documentation).

Now, we obviously still want to be able to make use of the remote farm when possible, but we’re trying to avoid hurting usability as much as possible. We have our own tools for synchronizing assets between locations, and things like the Pulse REST API make programmatic submission, deletion, etc. pretty seamless. But from an end-user standpoint, the farm management situation with Deadline is probably still going to be the weakest link in the remote workflow once we get it into production. We may actually end up needing to do something like set up a VM for our main wrangler so he can monitor both farms simultaneously… It’s just that important.

P.S. Even if we did use a single large farm and the latency wasn’t a problem, how would the remote log serving work? To be more specific, how would client X know which Pulse server to ask for the log to a given task?

rrussell · May 28, 2014, 3:47am

You can use the -new command line argument to launch additional monitors. You can use it in conjunction with the -repository option to connect to different repos. Maybe it would be an option for you to wrap these in separate shell scripts?
thinkboxsoftware.com/deadlin … le_Options

It connects to the Pulse that’s configured in the Repository Options for the repository that it’s connected to. So if you’re connected to the remote repo, it will pull from pulse running the remote repo.

How many users would need this enabled? We were under the impression that only wranglers would need this, in which case having it enabled all the time shouldn’t be an issue.

Cheers,
Ryan

cbond · May 28, 2014, 2:58pm

In terms of latency, things like write lock juggling between the local and remote locations will just make the situation worse (even for people sitting next to the Mongo server), since the entire Deadline database gets locked for every tiny status update… Imagine 250-300 slaves all trying to update job, task, and slave statuses in the DB in the same second. <<

this should be reduced dramatically in 7.

We’re looking at 200ms consistently between here and there, and Deadline does not handle it gracefully, even with an empty farm<<

i would like to have the guys see this and try. can we replicate it? can you set up a mongo at your remote location and have the guys in winnipeg test it? i ask because we developed deadline 6.0 while testing on remote repositories on purpose. this might just be tweaking some settings, or demonstrating some weakness we have to solve.

replication/sharding<<

7 will have support for shardsets. this might be an additional option…

cb

nrusch · May 28, 2014, 4:52pm

Ah, I had no idea about the -new flag. I’ve perused the command line options for all of the other client applications at one point or another, but hadn’t thought to check the Monitor. Combined with the -repository flag, that should make things much simpler.

Right, but hypothetically, what if I was running one large queue with split filesystems and one Pulse running on each side? Or would the expectation be that all of the logs be synced between locations as they were generated?

We’re probably looking at 10 to 12 users needing it enabled all the time on either side, so worst-case, let’s say 30 people hitting Pulse simultaneously (local + remote + some padding). However, that doesn’t take artists into account, and there may very well be times when some of them need to keep an eye on remote jobs as well. Part of the reason I think it would be nice to make that a connection-level option (or maybe a command-line parameter?) is so that people who need to keep an eye on both farms can only enable it in the Monitor instance viewing the remote side.

Hmm… Is the backend storage being split up across multiple databases, rather than just collections? Or are you talking about application-level optimizations? I understand if you need to keep things vague…

I’ll certainly ask our sysadmins in Mel if they can set up an external-facing Mongo server for you to play with. I know they’re stretched pretty thin over there, but I agree that it might be helpful for you guys to see what we’re dealing with here. Winnipeg to L.A. this is not.

However, even if the latency issues were mitigated in terms of the Deadline-Mongo interface, we’re still not going to be able to mount the filesystem remotely, and we need to continue to be able to tolerate connection failures, so I don’t want this to start anyone down a path that would ultimately lead to the same place.

As always, I appreciate the ongoing discussion on these issues.

cbond · May 28, 2014, 5:22pm

Hmm… Is the backend storage being split up across multiple databases, rather than just collections? Or are you talking about application-level optimizations? I understand if you need to keep things vague…<<

yup, both. i think that there have been some small optimizations, but mongo will be split up.

However, even if the latency issues were mitigated in terms of the Deadline-Mongo interface, we’re still not going to be able to mount the filesystem remotely, and we need to continue to be able to tolerate connection failures, so I don’t want this to start anyone down a path that would ultimately lead to the same place.<<

replica sets/sharding? i’m not the expert on this, but can’t we create a mongo shard or replica local to each location that will keep in sync and if things go down, continue to operate and then get ‘caught up’? or a location aware shard? i’m posting this here as a discussion point with ryan.

cb

rrussell · May 28, 2014, 6:15pm

Ultimately, the problem boils down to an unreliable file system connection between offices, and even if there were still large gains to be made from tweaking the database, the file system mount will still be a problem. Having local mirrored repositories in each office can workaround this issue for everything but logs and job auxiliary files, but the latter isn’t a concern here because you guys have your own syncing mechanisms. So if we can find a good solution for the log problem, that should help you guys run more smoothly right?

The pulse log streaming was a quick first start, since it was something we could put into 6.2 without much risk or refactoring, but I’m sure there are ways to improve upon this. There is the idea you had about just making it a command line option, so you could set it for remote connections and leave it disabled for local ones. That’s easy from a development perspective, so if that will work for you guys, we’ll do it.

The reason we didn’t go with a connection level option was that if you were to point to a remote Pulse, it would give you back the repository path of the repository it’s connected to. That’s not good for pulling scripts, plugins, events, etc. Maybe there is a solution where you run multiple pulses in a remote location but one is configured to handle local connections and one is configured to handle remote connections. However, that’s not something we could explore for 6.2, and some design time would have to be invested to design such a system. I know we eventually want to support a redundant Pulse system as well, so running multiple pulses could also have the benefit where if one goes down, another one can step in for things like housecleaning and pending job scans.

Neither of these are really options in this case. Sharding alone won’t help, because each shard only holds part of the database. Sharding is meant for scaling out and building a cluster. Tag-aware sharding could be set up on the database side of things, but then jobs and slaves would have to be named appropriately so that they were placed in the appropriate regional shard. However, that still doesn’t help when viewing the remote data (which is what you would be doing if all the data is in the same database). Replication is just for backup and failover purposes. Deadline doesn’t support reading from replica sets, and I don’t think we want it to, because you will never be guaranteed to see the most up to date information. It would be one thing if you were just viewing the data, but because you can interact with it, you don’t want to be working with data that is minutes behind.

nrusch · May 28, 2014, 7:41pm

Pretty much. We can deal with managing the repository setup (i.e. having to maintain and sync stub repository directories for remote farms), but having a seamless workflow for artists, managers, etc. is going to be very important.

Yeah, I definitely understand that there are limits to what you can squeeze into any one release. I’m constantly brainstorming things like this in the hopes of potentially continuing to improve the workflow over time, but I’m certainly not expecting instantaneous results. Squeezing the remote log serving via Pulse into 6.2 was a surprisingly quick turnaround, and one everyone here will appreciate.

A command line option seems like it would make for a useful addition though. Having the ability to turn on the Pulse-served logs for one instance of the Monitor via a launch option would let us package up a “remote monitor” launcher script to do everything in one shot.

Yeah, I see the problem there… the remote Pulse is only aware of its local copy of the repository, but the client is pointed at a stub version that’s in a directory with a different name. I can’t think of a particularly elegant solution off the top of my head, so that’s probably the kind of problem to come back to if necessary.

As far as the load of serving remote logs goes, what’s a threshold for simultaneous users beyond which you would start to get concerned about Pulse’s performance? If it makes a difference, we will likely be disabling the options to run the house-cleaning operations in separate processes, due to some previous issues with hanging processes.

rrussell · May 29, 2014, 2:38pm

Great! That’s what we’ll do next then. Maybe it might make sense to have a flexible command line option that lets you set the values for any of the available Monitor options on startup.

Something that might work for now is if you have your wrapper script update the “Deadline Monitor 6.ini” file, which should be in ~/.config/Thinkbox, before launching the Monitor. The ini file has a [General] section, and the property name is StreamLogsFromPulse. Set it to true to enable and false to disable.

Handling the 30+ users you mentioned shouldn’t be a problem. It’s using the same mechanism that Pulse used back when it acted as a proxy for the Monitor and Slaves, and it could handle hundreds of connections no problem. There isn’t any computational load on Pulse, since it just serves up the log file that is requested. The logs are also served by a different thread then then one used for housecleaning, and they use a thread pool so that multiple logs can be served at the same time without worrying about creating more threads than Pulse can handle.

Cheers,
Ryan

nrusch · May 29, 2014, 10:11pm

That’s great to know, thanks.