Deadline Monitor 5 really really REALLY slow

Deadline 3 was fast, Deadline 4 become somewhat more slow, and with Deadline 5 we are having such a slow Monitor response it makes me want to kick the server hoping it might speed it up.

When clicking a job in Deadline Monitor 5, any job with a medium to large amount of tasks (between 200 and 5000) literally takes from 30 seconds to over 5 minutes to actually show the job tasks.
Repository is running on a server over LAN (2 x link aggregated Gigabit Broadcom NeXtreme NICs), pulse is running on that server as well (monitor and clients have found pulse), repository can be accessed through either network share name (UNC) or as location on a mounted drive (Z: in this case)
Firewall is disabled everywhere.

Seriously, this is killing our workflow and i couldn’t really find any similar problems on the forum (most seem or firewall related or wrong pulse ip/name), so, any one have any suggestions?

Thanks in advance,

Sven Neve

Hi Sven,

That’s insane! We just tested a 5000 task job here and it refreshed in 2 seconds. Let’s start by taking a look at the Monitor log. Click on a job that takes a long time to refresh, and then when it’s done, select Help -> Explore Log Folder and find the most recent monitor log. If you can post it here, we’ll take a look to see if there are any errors.

Also, when you upgraded from 4 to 5, did you install a fresh 5.0 repository and reinstall all the clients? I just want to make sure it’s not the case where a 4.0 monitor is trying to connect to a 5.0 repository.

Finally, does it make a difference if the Monitor is connected using the UNC path versus the mapped drive?

Cheers,

  • Ryan

Hi Ryan,

we installed a completely fresh repository and clients (uninstalled clients and reinstalled with version 5)

As for connecting to the repository via mapped drive or UNC, hard to tell, via mapped drive some jobs seem to load faster but others take ages (14 task job taking a minute?, that can’t be right), not sure Deadline Monitor does some sort of caching behind the scenes of jobs it has previously loaded?

I’ll also check the network for packet collisions via Wireshark again (hard to do right now as there’s being rendered at the moment, though Broadcom BASC and the D-Link router diagnostics give barely any packet collision errors.)

most recent log (yes, it’s that small)

2011-07-08 10:06:23: BEGIN - HOUSE83\Sven 2011-07-08 10:06:23: Start-up 2011-07-08 10:06:23: Deadline Monitor 5.0 [v5.0.0.44528 R] 2011-07-08 10:06:23: 2011-07-08 10:06:23 2011-07-08 10:06:30: Attempting to contact Deadline Pulse (192.168.123.100)... 2011-07-08 10:06:30: Requesting jobs update from Deadline Pulse... 2011-07-08 10:06:31: Update received from Deadline Pulse. 2011-07-08 10:06:31: Received Update for 14 jobs. 2011-07-08 10:06:31: Attempting to contact Deadline Pulse (192.168.123.100)... 2011-07-08 10:06:31: Requesting slaves update from Deadline Pulse... 2011-07-08 10:06:31: Update received from Deadline Pulse. 2011-07-08 10:06:31: Received Update for 9 slaves. 2011-07-08 10:11:30: Enqueing: &Explore Log Folder 2011-07-08 10:11:30: Dequeued: &Explore Log Folder

I just noticed something else, when i click a job that takes a long long time, i can see it building the task window’s scroll bars after a couple of seconds, so the task window is aware of the amount of tasks it is going to show, however, the content itself takes a very long time.

Hope this observation can take us in the right direction.

Sven

Hi Sven,

That’s definitely useful info. A couple of questions:

  1. Are your graphic drivers up to date?
  2. Are your Windows installs up to date? These updates should also include updated .NET installs. Maybe you have an older .NET install and that’s causing the issue.

Also, does this happen on ALL of your machines, including the slave machines?

Cheers,

  • Ryan

Drivers are reasonably up to date (not running the absolute last versions, but that’s pretty much impossible bug and opengl weirdness wise, we run what works)
.Net 3.5 SP1 and updates, .Net 4 and updates are installed.

I just purged all log reports on the 14 task size job that takes ages to load, and now it’s instant (there where an insane amount of reports on that one, over 10000 ! )…come to think of it, for some reason everything is more responsive now ( do job reports have such a big impact on showing task lists?)

Very odd, i’m going to keep an eye on this, as it seems a combination of multiple factors (server overload, logs/reports, network oddness, etc)

Interesting. We’ll have to look into that and get back to you. My initial thought is that the log count shouldn’t affect load speed (and if it does, we should try to avoid that), but I’ll have to check the code to make sure. Just to confirm, it was log reports right (ie: not error reports)?

Cheers,

  • Ryan

Hi Ryan,

sorry to get back to this so late (none software related deadline and all), i checked to see with a job that has about 500 tasks, and it loaded really really slow, after checking it showed around 4000 error reports and 3000 re-queue reports, after purging these, the job reads fast again, not sure this is because it has already loaded this job’s task list in some sort of cache (even after closing and opening the monitor)

So some slowness seems job report related, however (oh dear, i hear you think), i have this job with 5800 tasks and NO reports, still takes too long, not minutes mind you, but long enough for windows to have the monitor go into a Not responding state (this time the window scroll bars don’t appear beforehand, but shows them until it actually shows the tasks)

So, basically still no idea why it shows this mixed behaviour.

Sven

Yeah, 5800 task can take a little bit to load, although it’s only taking about 2-3 seconds with our tests. Maybe the slowness is due to heavy network traffic?

There are a couple of things you can try to help your situation. The first is to enable Failure Detection under the Job properties in the Repository Options:
thinkboxsoftware.com/deadlin … ions/#Jobs

You could set the limit to 100 or 200 before the job will be marked as Failed. Not only will this reduce the amount of reports stored on your repository, but it will also cut down on wasted CPU cycles because slaves won’t spend so much time erroring out on the same job again again.

For the 5800 task job, what type of job is it (Max, Nuke, etc), and how fast do the frames render? If the frames only take a few seconds to render, you could bump up the number of frames per task. Not only will this cut down on the task count, but it will also likely result is less overhead when rendering the job (this depends on the type of job though).

Cheers,

  • Ryan