Wish: make deadline monitor faster

Please, make Deadline faster. I mean the Launcher app. It’s slow even on a 1gbps line at the studio and over a VPN (8mbit) it’s terrible! Even the “remote mode” doesn’t help.

Generating a submission file takes about 30s, which, honestly, is ridiculous! If the file is about 1KB or, hell, even 100KB, why does it take 30s to upload to my server over a 8mbps line?! I’ll never understand this.

An ideal solution (imho) would be a web Launcher GUI connected to a database behind Apache or IIS.

Thank you!

We are exploring ideas to make Deadline perform better over a VPN connection (and in general). Deadline wasn’t originally designed with remote connectivity in mind, and Remote Mode is a solution to help alleviate the problem when simply monitoring a remote farm (it doesn’t help job submission). However, there’s a good chance changes may be coming in the next major Deadline release…

The reason why job submission is slower than you expect is because the job xml file and task files for the job are created on the repository, not the client. They are created in the repository’s temp folder, so that a “half submitted” job isn’t sitting in the jobs folder. Once the files have been created, the entire job folder is moved from the temp folder to the jobs folder. So there is more going on under the hood than just transferring a 100KB file over the connection. :wink:

Cheers,

  • Ryan

But all that is happening on the server side, isn’t it? I have a dual CPU server with about 48GB of RAM, that should be way enough for a simple XML file generation. :confused:

The client (deadlinecommand) is creating and moving files on the server side (the repository), and all of this is done over the VPN connection. That’s the bottleneck. There is a workaround we implemented for internal purposes, and it’s called a drop job:

C:\Users\Ryan>deadlinecommand help SubmitDropJob

SubmitDropJob                 Submits a drop job.
  [-compress]                 Optional. If specified, the job files will be
                              compressed before submitting the drop job
  [<Job Files>]               The job files

This submits a stub job to the repository in the jobDrop folder. When Pulse does its repository cleanup, it will submit any drop jobs so that they can be rendered. In this case, the task files are created by pulse on the server side, which definitely helps submission performance. This isn’t built into any of our submitters though, so you would have to modify them to use SubmitDropJob.

Nice! Thanks for the tip, I’ll try that.

Also, could you, please, make the Monitor multithreaded? The constant unresponsiveness while it awaits data is getting on my nerves. :slight_smile:

If you also managed to make all the individual ListViews multithreaded so that when one is waiting for datat others are still responsive and while waiting it shows a little “progress” icon animation, so that the user knows it’s waiting for something, that’d be gold. :wink:

A more responsive monitor is definitely something we have on our radar. Again, this is more of a symptom of using the monitor over the VPN, and the ideas we’ve been looking at will help address that directly.

Well, to be perfectly honest, I’ve noticed a few lags even when the Monitor is being run on a 1gbps LAN line… :neutral_face:

are you running pulse?
are you submitting jobs TO the repository?

cb

Yes

Yes

:slight_smile:

Aha.
How is your repository connected? e.g single Gigabit? a trunk? what is the drive? how big are the files you submit? how many people submit? what OS are you using?

I’ll give you a real-world scenario that i encountered:

Imagine you are submitting jobs TO the repository, thus the repo is acting as a file server.
Let’s assume that your max file [or whatever] is 2 gig…which is fairly commonplace for large film/video productions -
Now let’s assume that you have 50 slaves.
For argument’s sake, lets ensure we have only one person submitting, now the repository is copying 50 * 2 gigbyte to the slaves…
…and your repository is connected to your network over a single gigabit connection.

Gigabit Maxes out 125MB/s * 0.8 = 100MB/s (my functional ceiling for ethernet and not realistic).

as above, a single 2GB MAX file = 2048MB
with 50 nodes/slaves

= 100GB for bandwidth needs.

Thus:

1001024/125 = 819.2 seconds = 13.6533 minutes
100
1024/100 = 1024 seconds = 17.0666 minutes

assuming you can sustain 80-100% of your theoretical bandwidth [very unlikely] you are hosing your single gig port on the deadline repository for roughly 15 minutes with one job submission. which means connectivity would tank for anything else. thats in a perfect world, often performance is far worse!

NOW - perhaps your data files are only a few hundred megs, or even just a few megs. perhaps your slaves are just 20 - even still, the deadline repository machine generally is overlooked as a bottleneck to production, because people like to submit jobs to it. i understand why: it’s an easy way to get version control and is somewhat unique to the deadline workflow. the number of times asses have been saved by having a renderable asset on the repository is crazy!

anyway - this feature is commonly misused, abused or just plain misunderstood - and we have started to move people away from using this by default. For example, in deadline 5.1 you can point slaves to your existing file on the server, or you can submit to the repository, OR: you can submit to a 3rd location where you can share the data across the network. see the attached images for a visual description:

Simple_infrastructure.jpg
Simple_infrastructure_2.jpg
Simple_infrastructure_3_submit to 3rdlocation.jpg

In closing: In terms of performance of a networ, there are a lot of ways to improve performance of deadline, of interactivity and so forth, but i want to be sure we hit the right nail on the head…so if you could respond with your datapoints [network nfrastructure, data size, users, slaves etc] that would help. Essentially if you let me know what your infrastructure and dataset is, i’ll be happy to talk to you about improvements that can be made if necessary!

I’m not going to assume we dont have bugs, and i know we have things we want to improve - and your case might not even be applicable to our example if you are speaking about scene data files that are pointers or simple <1mb scripts [a la fusion or nuke]…in which case i’m really interested in hearing about your case!

But for the record, I’ve seen movies with 2-3 gigabyte data files being submitted by a 50 artists and rendering on 500 + machines and deadline never crashed, burned, failed or went kaput. i’ve seen plenty of server and infrastructure slowdowns though! generating 50+ terabytes of data in a few days can bring a lot of things to their knees :open_mouth: but frankly, with all of the problems these crazy movies bring, and once you get past the voodoo, in every single case i’ve seen on these movies - Deadline wasn’t the problem. it was the tool that revealed problems!

cb

Wow! :open_mouth: thanks for the comprehensive explanation and examples, cb.

However, I might dissappoint you a bit. I never, ever submit actual scene files with the jobs. That being said, I only submit (generate the) job files and point to a network/shared location on my storage.

And as for VPN, even submitting Nuke comps that are about 10KB is super slow.

My render infrastructure at the studio is very, very simple:

1 server acting as a DC, Repository and other roles
1 DAS Dell MD1000 with 15 .5TB drives. Slow drives, though, only 7200rpm Western Digitals. Very reliable though.
1 switch, also Dell
7 render nodes, all connected to the very same switch as the server
Then the switch is connected to a line that goes to the office with a few more PCs.
All is connected to a 1gbps line.

The VPN is tunneled through the internet and the router which sends it to the switch the server is housed. The VPN is a WAN approx. 8mbps from my house, but on a 25mbps line at the company. Unlimited, up/down, 25mbps, 100% reliable (so far).

The server runs Windows Server 2008 Standard x64, it’s a dual Xeon, 48GB RAM etc… the DAS is directly connected to it via a SAS cable to a PCIe controller. I haven’t measured the throughput, though. The DAS is a HW RAID5 with 15 drives, as I’ve already mentioned. The render nodes and the PCs at the office are very low-end, standard, Intel i7 cpus with about 12GB of RAM each. Some 8GB, though. Some are Sandybridges, also.

I have to say that I have never had a problem with scenes we submitted to the farm via Deadline. All went perfectly well, but the Monitor app is just way too slow for my tastes. Also, I am running Pulse, all the time. The servers average usage is about 5%, so… :slight_smile:

Hmm. this gives me some ideas.

So the repository is on a server with direct attached storage? so the storage is being SERVED via your server, which is also serving the repository? So even though you aren’t submitting to the repository, you are still passing data across that same gigabyte bus?

Check out the map i made - rough, i know. tell me if this is correct:

Loocas.jpg

  1. Spinning Metal isnt the best for the repository. i’ve seen hundreds of computers served off of the Deadline Repository with 2* SSD Drives striped. if you had the time/energy/money, i would suggest dropping a Sandforce 2 SSD such as the OCZ Vertex 3 Pro or something similar into a machine and building a repository on that and experimenting. random seek is the key, not transfer speed. A large spinning raid isnt going to be awesome for this application…i’m assuming you put the repository on the DAS? or is it on a local drive on the Server?

  2. you say the server is doing other things? well, its also serving your data if my map is correct! you might get more bang for your buck if you drop the aforementioned SSD into a slave [for testing] and drop a repo on it. turn off the slave during testing of course, but see if/when your performance increases. it might be counterintuitive, but the machine you are spec’ing is overkill for the repository, and not in the right way. i would give up processors and ram for random access drives. as noted, i’ve seen a repository built from scavenged parts running literally 300 slaves + 75 or so workstations - with 2* SSDs, and a 7200 rpm drive for archival. i think maybe 4 gigs of ram…

  3. What are you running pulse on? i would recommend running it on something else, not the repository.

  4. Your server at 1 Gig is a bottleneck, even without deadline. i dont know how big your renders are, or source files - or texture maps, but i have to believe you can saturate that with the drives you have - but i’ll bet there is still some contention. a quick fix is to add a second gig card on your server and trunk it with the other [if your switch allows that] and if it doesnt, then you can do some work to physically split the network into vlans or some other techniques to load-balance. i can pass that info on to you if you want…however you should check the network utilisation during production on that server card. tell me what it looks like…

  5. sanity check - when the monitor is slow [or always!] is the little gold heart icon at the bottom on? just checkin’ the obvious!

chris bond

Hmm… seems I’ll have to do a lot of testing, then. :slight_smile:

Yes, the schema is pretty much exactly my setup.

Ad 1) The repository is on the DAS, as also production data. The SSDs trick might be pretty neat, actually. I won’t be able to afford a full replacement of the 15 500GB drives for SSDs, that’d be insane, but I can buy two or so SSDs for the repository only, that shouldn’t be a problem. Or one SSD, rather large, rather fast, might be even better (the server is in a 1U case, so there aren’t many available hdd bays).

Ad 2) Yes, the server is the central hub for the studio. It does everything. But, again, its load is not so high. I’m planning on putting up a ton of virtualized OSes for specific purposes (database, asset management etc…) for that.

Ad 3) Pulse is running on the same machine where the repository is. I currently have only one server which is always on, I don’t have another machine to run Pulse on.

Ad 4) I was thinking about that, but I’m afraid setting up dual networks in a Windows environment isn’t possible. I haven’t looked at VLANs though… might be a solution. The server has two nics, naturally. I won’t be able to fit any extra nics in there, though. Space constraints.

Ad 5) Of course. :slight_smile:

The biggest gripe with Deadline’s Monitor I have could be seen on this video:

http://www.duber.cz/camtasia/deadline_monitor_performance/deadline_monitor_performance.html

It’s a real-time real-use scenario.

If you can make Deadline faster in this, and only in this, I’ll be jumping for joy! :slight_smile:

Also, the multithreading thing, the “not responding” state is just annoying. Nothing more or less, just an annoyance.

i would suggest not housing the repository on the server. it’s not about CPU usage, it’s about network interrupts and contention. in order of importance:

  1. definitely do not house the repository on the same drives as your production data
  2. increase your bandwidth [e.g. more bandwidth]
  3. get deadline off your production server entirely
  1. are you looking at the nic?
  2. are you looking at kernel time?
  3. do you know what your interrupt queue for data from the drives is?
  4. a bunch of VMs on one machine is not going to improve any of hte above - in fact, with competing machines it may make it worse!
    [quote=“loocas”
    Ad 3) Pulse is running on the same machine where the repository is. I currently have only one server which is always on, I don’t have another machine to run Pulse on.
    [/quote]
    Turn it off, and try running it on a slave - even if the slave is still working as a slave. just try it - it’s free!

chris bond

  1. I just bought the Vertex 3 SSD to put the Repository on there, specifically. :slight_smile:
  2. Unfortunately, I’m afraid I can’t do that. :frowning:
  3. Again, this is something I can’t do at the moment.
  1. Um, will have to monitor that for a while
  2. Nope. To be honest, where exactly do I find it? (Task Manger?) And what is it good for?
  3. Unfortunately, I don’t.
  4. Yeah, I know that, but I’m a very small company and I need to utilise as much HW as possible. So, that’s why I have one super-beefed server serving a ton of stuff at the company.

Yeah, I can try that without issues, but I seriously doubt I’ll see any difference. Besides, I mainly have problems with Monitor responses, other than that, Deadline runs perfectly fine.

I’m currently trying to optimize Dealine as much as I can…

and I simply assumed that I could modify the submission scripts this way:

ClientUtils.SubmitDropJob( <stringCollection> )

But, I get: SubmitButton ‘type’ object has no attribute ‘SubmitDropJob’

What is the correct syntaxt for using this paradigm?

Thanks a lot in advance!

Got it!

All I needed was a “-drop” flag. :slight_smile:

So:

[code]arguments.Add( “-drop” )
arguments.Add( jobInfoFilename )
arguments.Add( pluginInfoFilename )

exitCode = ClientUtils.ExecuteCommand( arguments )[/code]

seems to work!

Well…

here’re my findings:

  1. I’ve moved the repository completely to its own dedicated SSD drive (Vertex 3 as suggested)
  2. I’ve disabled antivirus and cleaned up some DNS trouble I had on the server
  3. I modified all my submission scripts to include the “drop” flag

The visible speed-up via VPN? Barely any! :cry:

Again, the worst thing is, everything in Deadline happens on the client side and then gets moved over to the server. Why? Why can’t everything, instead, happen entirely on the server side? That’s where the party is, that’s where everything happens and that’s where it all matters actually!

If you modified the Monitor etc… so that everything would be only a matter of a command being sent over a network and happening on the actual server, I think you’d gain A LOT by doing that.

The second thing is mutlithreading, as mentioned already. Just add a few threads to the Monitor, so we don’t get a huge white rectangle every time we’re waiting for data to be fetched from the repo.

I think, that’s about it. These two, rather simple, yet very substantial features will make the Monitor the preferred way of submitting jobs on LANs as well as WANs!

A thought that just popped in my mind:

If you could make the client data to synch over the network from the server to the client machine so that all the submission and plugin scripts would be up to date, but locally stored, you’d also achieve another minor speed up of the Monitor when submitting stuff or fetching Repository data etc…

The synch would happen on monitor open (from server to client) and every time anything got updated (from client to server) on the clinet side (i.e. repository options etc…).