deadline db machine network usage

LaszloSebo · November 27, 2013, 1:13am

Its kinda scary the amount of traffic going through the deadline database machine. Not sure if its pulse, or mongo generating the traffic, but it is hovering around 400Mb-1Gb.
Right now, its at:

rates: 1.37Gb 655Mb 349Mb

Is that due to the constant updating?

LaszloSebo · November 27, 2013, 1:39am

We have individual machines peaking at pretty large usage:

deadline.scanlinevfxla.com => lapro2059.scanlinevfxla.com 108Mb 78.4Mb 26.9Mb

rrussell · November 27, 2013, 2:45pm

That’s probably from when the slave was doing some housecleaning. Was this before or after upgrading to beta 12? If it was from before, has it improved since upgrading?

LaszloSebo · November 27, 2013, 5:22pm

We are about 60% rolled over to beta12, it has not improved.

Most of the traffic seems to be wrong workstations, probably monitor updates? Individual machines can go up to as high as 200-300Mb/s

I also see surging traffic from rendernodes that are in the process of dequeuing something, but the majority of the traffic is from workstations.

rrussell · November 27, 2013, 6:57pm

That’s good to know that it’s workstations, and it’s more than likely the monitor updates (the launcher wouldn’t be pulling that much data over). Currently, the Monitor maintains 2 open connections to mongo, so I wonder if it would make sense to reduce that to 1. It won’t change the amount of data being pulled over by much, but it would throttle it. We’ll run some tests here to see if what impact that might have.

Just to confirm, is the 200-300Mb/s peak consistent, or could it be that it peaks when the Monitor is first started, but then goes down once the Monitor has all of the jobs loaded?

LaszloSebo · November 27, 2013, 7:06pm

Hard to tell, im just looking at an iftop cross section. But the same machine numbers keep popping up all the time.

Although, i think its consistent (not constant, but periodic), because one artists machine that i recognized that was doing 200+Mb/s was not yet in the office. So the monitor must have been left open on her machine

rrussell · November 27, 2013, 7:15pm

If you open the Task Manager on the workstation, what do you see for network utilization? With our simulator updating 1000 jobs and 1000 slaves every interval, we’re only seeing an average of 1-2% of a 100 Mbps connection, with peaks of 6-10%. The initial startup was 50-60%, but that is to be expected when pulling down the initial data.

LaszloSebo · November 27, 2013, 8:23pm

On a monitor already running, i see the cpu usage hover around 100% on a single core, dropping to 0 for a couple of seconds every now and then. Network usage is low, around 0-1% of a 1G connection.

Fresh start of the monitor, for about 30 seconds there is hardly any activity, then network usage goes up to between 5-20% (so between 50-200Mb). The machine instantly pops up on iftop on the mongo machine.

rrussell · November 27, 2013, 9:40pm

On the machine that is using 100% of one core, do you have dynamic sorting enabled on the job list and/or the slave list? We’ve just discovered that if dynamic sorting is on, it can impact the Monitor’s performance and cause brief hangs. With it off, we tested it with our simulator running, and it didn’t hang up once. With the dynamic sorting off, I found the cpu usage dropped as well.

So we’re actually going to start disabling dynamic sorting on the job and slave lists by default.

LaszloSebo · November 27, 2013, 10:09pm

No, dynamic sorting is off. The cpu usage goes up every couple of seconds, for a couple of seconds. The gui does not hang as it used to with the new versions, its just very slow, laggy. Not too convenient to work with.

rrussell · November 28, 2013, 1:33pm

Can you save your Monitor layout and send it to me? Maybe there are things other than dynamic sorting that can an impact on performance. We could drop your layout here and see if we notice any lagging as well.

Also, what are your current numbers for:

Total jobs in queue.
Total completed jobs.
Total queued jobs.
Total rendering jobs.
Total slaves.
Total rendering slaves.

We can adjust our simulator as necessary to make sure we’re putting the same load on the Monitor during testing.

Thanks!

Ryan

LaszloSebo · November 28, 2013, 8:15pm

Emailed you the information + layout!

rrussell · November 28, 2013, 8:36pm

Thanks Laszlo! When I applied the layout on our end, I noticed that every list had Dynamic Sorting enabled, and things were stuttering a bit for me until I turned of Dynamic Sorting on all lists. Then things ran without any stuttering at all. I’m working from home and testing things remotely today, and with our simulator running we have ~21,000 jobs (~1000 running, ~19,000 idle, ~1000 complete). We also have ~1000 slaves rendering.

Can you double check on your end if Dynamic Sorting is enabled?

Thanks!

Ryan

LaszloSebo · November 28, 2013, 10:13pm

Oh, darn. Yeah dynamic sorting is on, my brain shot circuited, and i thought you were asking if the job candidate filter is on.
With dynamic sorting off on all panels (even ones not visible on other tabs), things are much better.

I turned dynamic sorting off now for some of the people who have given me most feedback about speed issues, and will see what they thing.

Is that something that you guys can improve? Or its just what it is… pyqt and all

rrussell · November 29, 2013, 2:13pm

All the sorting is currently handled by Qt under the hood, so I’m not sure if there is much we can do about it. I was thinking if possible, maybe we could prevent sorting from happening if data in the column being sorted on hasn’t changed, but there is still the case when new items are added to the list where a sort would have to be performed.

LaszloSebo · November 29, 2013, 5:52pm

Might be interesting to do some timestamping to see where the majority of the time is spent, if its the comparison functions maybe it could be improved by providing some sorting queue thats a simple number value, as opposed to string operations / date conversions etc.

rrussell · November 29, 2013, 6:15pm

That’s typically what we do when sorting isn’t actually based on strings. For example, job states are represented by integers when sorting, and submit date/time is represented by floats (milliseconds). You can basically assign display data and sort data to each cell when building up a row, and that’s what we do.