Deadline Latency

Hey,

I’m experiencing some latency issues on the majority of my Slaves as well as the Deadline Monitors. I’ll provide a brief breakdown of what our setup looks like and the specific error below:

• Mac OS Server (Running through a Mac Pro)
• 18 PC Render Nodes
• 20 Mac Pro Work Computers (become slaves when unused)
• Version of Deadline: 5.0
• Running Pulse
• Rendering to a NAS, then copying back to Server

We are running Pulse on one of the Slaves, and it’s helped somewhat, but for whatever reason, when I start up a slave, especially on one of the 20 work machines we use at night, it takes anywhere from 5 to 30 minutes for the slave to connect to the repository. After that, it takes at least another ten minutes for the slave to pickup and start rendering a job. These times vary computer to computer, as a few pickup almost instantly, while others will only connect after a few failed attempts and re-starting the Slave. We also get Slaves that hang up at 100% on a completed task, and as a result sit that way until manually telling the slave to move to the next task or shutting the slave down.

I’ve seen a few other users have had latency issues, but I don’t feel like any of them we’re running the number of machines that we were. Can you help?
Thanks!
~ Dave

Hi Dave,

I just have a few initial questions that should give us a better understanding of your situation and help us figure out what’s going on:

  1. Are the issues with the Slave only occurring on the Mac workstations, or are the PC nodes affected as well? This goes for the latency, as well as the “stuck @ 100% percent” problem? This should help us narrow down if the problem is specific to the OS, or a more general problem.

  2. Do you only see this problem at night when the workstations join in, or is it a consistent problem throughout the day?

  3. When things start to slow down, what does your network traffic look like? We’ve seen cases where the type of rendering that was occurring on the farm would bring the network to a crawl (ie: huge scene files with huge textures), and this of course affects Deadline’s overall performance.

  4. When a Slave gets stuck at 100%, is this with a specific application (ie: Maya, Nuke, etc), or do you see this across all job types? The next time it happens, go to the slave machine and from the slave GUI, select Help -> Explore Log Folder. Find the slave log from the current session and post it. Make sure to do this before manually telling the slave to move on so we can see where it gets stuck. It could be that the application that Deadline is rendering with has hung, and if it’s not printing out an error message, usually all Deadline can do is assume it’s still happily rendering.

  5. How many jobs do you have in the farm? Make sure to include the archived job count in this total (if you have any archived jobs).

This should give us a good starting point.

Thanks!

  • Ryan

Hi RRussell,

Thanks for the quick response. Responses below …

  1. Yes, PC and Mac Nodes both take a long time to connect to the repository as well as scan the repository. This has only become a problem in the last two weeks, and does fluctuate slightly depending on how many machines are trying to connect at the same time.

  2. This is a consistent problem, although it becomes compounded when traffic gets heavier.

  3. I’m working on getting you a network activity reading. I’ll respond as soon as I get one.

  4. I’ll grab this the next time a slave gets stuck at 100% (although this problem is less frequent than the others)

  5. Jobs: 27 Queued, 1 Active, 23 Suspended, 210 Completed. We don’t have any archived jobs, although we have cleaned out another hundred jobs or so after they’ve sat labeled as ‘Completed’ for more than two weeks

Thanks again for the help. I’ve been somewhat thrown into Deadline, so sorry if my responses are as eloquent as needed. Hopefully this is the kind of info you’re looking for. Go Jets, right?

Go Jets indeed :smiley:

Something I’d check is the load on whatever server you’re hosting your repository on. If it’s the NAS, that’s going to be tricky.

The load is a measure of how many applications are trying to run on the CPU at the same time. If the load is really high on that machine (which I’m guessing it is) it could be the service/daemon that’s serving the repository is waiting a long time to run. We have the same issue here with one of our VM servers from time to time.

To find the load, open a terminal or console and type ‘uptime’. That’ll say how long the system’s been up as well as its load averages. If the load is much higher than the core count of the machine (say 36 on a 12 core Power Mac), then you have a load issue.

If it is a load issue, you’ll need to figure out what’s causing it (slow hard drive, too many applications running), and try diagnosing stuff. That’s getting a bit ahead of ourselves though, so for now, the output of uptime should be all we need.

Example:

root@vm-01:~# uptime 11:41:49 up 50 days, 20:37, 3 users, load average: 0.52, 0.42, 0.44

Hey,

Here’s the uptime output from our Server that everything is running through: 12:51 up 21 days, 2:41, 2 users, load averages: 2.53 2.69 2.45

We used Wireshark to generate the following screen grabs of our network activity. I’m not entirely certain how to determine Load sizes, but maybe there’s some info here that will jump out for you. Would you recommend a different app to get the info about our network that you’re specifically looking for?

loyalkaspar.com/lk/Deadline_Trou … 851%29.png
loyalkaspar.com/lk/Deadline_Trou … 656%29.png

Let me know what other information I can provide to help you troubleshoot.
Best,
~ Dave

Well, the load seems normal for a regular server. I’d keep monitoring the CPU load throughout the day especially when nodes seem to be taking the longest. Also, try it on the machine running Pulse come to think of it. If the renders are causing the machine to bog down, Pulse might not be able to service requests as well. Though, the machine would need a seriously high load for that

Dumb question, but it’s good for clarification, is the OSX server running OSX Server, or one of the desktop variants of OSX? You might be hitting the artificial connection limit if it’s not a server-class OS. Both Windows and OS X suffer from that since it’s marketed as a security feature.

As far as the Wireshark images, are those counts per second? It would be good to know the rate that data’s coming in versus out. Though, I don’t think that should cause the huge delays you’re seeing either.

This whole thing feels like that connection limit problem. Just a gut feeling.

I’d try moving Pulse to some machine you know to be mostly idle (Maybe even the server since the load isn’t that bad), and preferably to a server based OS.

Hey,

We’re definitely running the OS Server, although it is the Lion server. I’m not sure if that’s why it’s super buggy … cursed Lion.

I also wonder if it might be our NAS. I know that if we have a lot of machines accessing it, that could slow it down exponentially. When rendering to a separate repository, are there specific makes that might work better with Deadline and 20+ machines running their slaves?

As for the Wireshark grabs, those are over the course of about 45 minutes. If you’re looking for specific specs, I can run it again and try and isolate exactly what you’d be looking for.
Just let me know and I’ll do a little digging.

Thanks for all the help, guys. Pulse on the server is what I’m going to try next, and see if that helps.
~ Dave