monitor start time

LaszloSebo · November 25, 2013, 10:44pm

Hi there,

Ryan you asked us to test the monitor startup time a while back. Now on beta11, we get:

~ 90 seconds till the GUI appears
~ +110 seconds till all the jobs are refreshed

So about 3 and a half minutes total to get an up to date monitor up

cheers,
l

LaszloSebo · November 25, 2013, 11:19pm

Also tried from my machine which is remote to the db (vancouver box, LA repo):

~ 6min 8seconds till the GUI appears
~ +2mins 32 seconds till all the jobs are refreshed (job counter stops growing)

8mins 40 seconds total for a fresh start to show the current state.

Both tests were done with the status quickfilter turned off, so that the jobs panel list is empty.

rrussell · November 26, 2013, 2:34pm

Hmm, it shouldn’t be taking that long for the UI to be displayed. Does the splash screen show up right away? If so, is there a particular startup message it gets stuck in for a while? Regardless, we’ll add some start up stats to the monitor log for beta 12 to try and pinpoint where the slowdown is for you guys.

Cheers,

Ryan

LaszloSebo · November 26, 2013, 4:55pm

The splash screen does show, but its a bit random. If i start deadline and move to another computer, the splash displays after about 40 seconds or so (guesstimate). But if i keep working on the same box, i usually dont see the splash image, or it hides behind another window and becomes constant white.

rrussell · November 26, 2013, 7:18pm

Weird… anyways, we’ll get those startup times in there to help determine the problem. Turns out this won’t make it into beta 12, but it should be done for beta 13.

rrussell · November 28, 2013, 3:24pm

For the Monitor startup speed, it’s looking like the following things could cause the slowness:

plugin icon synchronization
monitor script scanning

If you were to remove the plugin folders from the repository that you don’t use (\your\repository\plugins), as well as the Monitor script files that you don’t use (\your\repository\scripts), that will probably help startup speeds.

Cheers,

Ryan

LaszloSebo · November 28, 2013, 7:53pm

I did a quick session with process monitor , and have found the following.

The first couple of seconds are spent initializing registry entries, loading language locales, etc:

11/28/2013 11:20:57.6598746 AM <-- from
11/28/2013 11:20:58.9490299 AM <-- to

The it initializes the base python environment
11/28/2013 11:20:58.9490539 AM <-- from
11/28/2013 11:20:59.2099956 AM <-- to

Then it starts doing a weird thing where every dll it tries to load goes through all our site folders:

\S2\exchange\software\managed\pythonScripts\sitecustomize\System.Collections.Generic.Queue.dll
\S2\exchange\software\managed\pythonScripts\sitecustomize\System.Collections.Generic.Queue.exe
\S2\exchange\software\managed\pythonScripts\site-packages\System.Collections.Generic.Queue.dll
\S2\exchange\software\managed\pythonScripts\site-packages\System.Collections.Generic.Queue.exe
\S2\exchange\software\managed\pythonScripts\site-packages_2.6_64\System.Collections.Generic.Queue.dll
\S2\exchange\software\managed\pythonScripts\site-packages_2.6_64\System.Collections.Generic.Queue.exe
\S2\exchange\software\managed\pythonScripts\site-packages_configured\sip\windows_amd64_msc8_python26__4_8\System.Collections.Generic.Queue.dll
\S2\exchange\software\managed\pythonScripts\site-packages_configured\sip\windows_amd64_msc8_python26__4_8\System.Collections.Generic.Queue.exe
\S2\exchange\software\managed\pythonScripts\site-packages_configured\PyQt4\windows_amd64_msc8_python26__pyqt4_5_qt4_5_0\System.Collections.Generic.Queue.dll
\S2\exchange\software\managed\pythonScripts\site-packages_configured\PyQt4\windows_amd64_msc8_python26__pyqt4_5_qt4_5_0\System.Collections.Generic.Queue.exe
\S2\exchange\software\managed\pythonScripts\site-packages_configured\scl\windows_amd64_msc9_python26__qt4_7_1_internal\System.Collections.Generic.Queue.dll
\S2\exchange\software\managed\pythonScripts\site-packages_configured\scl\windows_amd64_msc9_python26__qt4_7_1_internal\System.Collections.Generic.Queue.exe

It does this for a lot of files:

System.Collections.Generic.SortedList.dll
System.Collections.Generic.SortedListKeyEnumerator.dll
System.Collections.Generic.SortedListValueEnumerator.dll
System.Collections.Generic.Stack.dll
etc

11/28/2013 11:20:59.1609886 AM <-- from
11/28/2013 11:23:22.7609421 AM <-- to

Then does more misc stuff, loading pyqt plugins, read write tests, loads a LOT of registry entries:

11/28/2013 11:23:22.7701618 AM <-- from
11/28/2013 11:23:24.8766566 AM <-- to

Then it queries all the repo folders (note that this is from a remote machine, so timestamps for this might be slower then usual)
11/28/2013 11:23:31.6968365 AM \inferno2.scanlinevfxla.com\deadline\repository6\custom\scripts\General
…
11/28/2013 11:23:54.4287151 AM \inferno2.scanlinevfxla.com\deadline\repository6\scripts\Limits

There are constant queries to the deadline server on port 27017 as well, some taking up to a second.

A big part of the tests it does is looking for plugins in the custom folder, it takes about 0.8 seconds each. For example for a single file:
Checking for \inferno2.scanlinevfxla.com\deadline\repository6\custom\plugins\MetaFuze
11/28/2013 11:24:17.0220626 AM <-- from
11/28/2013 11:24:17.7477641 AM <-- to
I guess this could be optimized by just caching what folders exist in that folder once, instead of checking for each plugin.

Then when the splash screen is stuck saying “Initializing main window”, all other traffic stops and i just see LOADS of small requests going to mongo, most packets are 1460 byte large. This takes a loooong time. It seems like its either loading all the slaves data, or all the jobs, not sure. The weird thing is, that i didn’t think the monitor loaded any jobs before its gui is displayed, since when the gui pops up, we see the number increment from 0-> to our actual job count slowly, by increments of about 200 or so.

This mass load took about 6 minutes:
11/28/2013 11:37:01.3586355 AM <-- from
11/28/2013 11:43:06.9335125 AM <-- to

Sidenote, the icon load time was (it loaded from the local C: drive, and i dont see any synchronization from the repository folders … weird):
11/28/2013 11:37:00.5720014 AM <-- from
11/28/2013 11:37:00.6811412 AM <-- to

After the monitor comes up, it still takes about a minute for all the jobs to appear, with constant traffic to mongo.

I have attached the full log as a csv file.
Logfile_monitor_launch.zip (1.99 MB)

rrussell · November 28, 2013, 8:40pm

Everything but the “initializing main window” part seems normal. I think Deadline is scanning your site folder for dlls because that folder is in your PATH environment variables, and that’s what windows applications fall back to if they can’t find certain dlls.

We’ll have to do some profiling on our end, because I can’t see what the Monitor could be doing that “initializing main window” phase that would cause this behavior. It’s definitely not loading all jobs and/or slaves (at least, it shouldn’t be). On my machine, that part is taking less than one second (based on the logging we added in preparation for beta 13).

rrussell · November 28, 2013, 10:09pm

After some digging, it looks like the initialization code would scan over the monitor scripts a second time, and then after the UI is initially displayed, it would scan over them again. This will be fixed in beta 13, so it will be interesting to see if this makes a difference during startup.

Cheers,

Ryan

LaszloSebo · January 14, 2014, 12:27am

Here is a recent startup log:

2014-01-13 16:19:30: BEGIN - VCPRO1014\ScanlineVfx_user
2014-01-13 16:19:30: Deadline Monitor 6.1 [v6.1.0.54178 R]
2014-01-13 16:19:31: Time to initialize: 577.000 ms
2014-01-13 16:19:50: Time to connect to Repository: 5.273 s
2014-01-13 16:19:50: Time to check user account: 63.000 ms
2014-01-13 16:19:50: Time to purge old logs: 15.000 ms
2014-01-13 16:19:50: Time to synchronize plugin icons: 297.000 ms
2014-01-13 16:24:15: Time to initialize main window: 4.405 m
2014-01-13 16:24:15: Main Window shown
2014-01-13 16:24:15: Time to show main window: 187.000 ms

What happens in the “initialize main window” phase? The ram usage also went up to 2 gigs while it was hanging there

rrussell · January 14, 2014, 2:32pm

This is only on your machine, right? I didn’t see this behavior when remoting into a couple of your other machines over the past few weeks. If I recall, your machine connects remotely to the farm, but I’ve done remote tests to our cloud-based repository here and I don’t see this behavior either.

I wonder if it’s something specific about your machine. Can you try uninstalling the Deadline client on your machine and then reinstalling it using the Client installer? Do you have access to another machine you can use to connect remotely to the farm and see if the Monitor startup is as bad?

Cheers,
Ryan

LaszloSebo · January 14, 2014, 7:59pm

We had around 23k jobs in the queue at the time. I cleared out about half of that, and its a bit faster now.

Ill try on other machines as well, there isnt much special about mine though.

rrussell · January 17, 2014, 2:33pm

Hey Laszlo,

Just wanted to follow up here. Specifically:

Have you tried reinstalling the client on your machine to see if it helps?
Have you seen this slow startup problem on other machines besides your own? If you are seeing it on other machines, is it possible for us to get remote access to one?

Thanks!
Ryan

LaszloSebo · January 24, 2014, 5:50pm

Different machine altogether:

2014-01-24 09:44:14: BEGIN - LAPRO3067\laszlo.sebo
2014-01-24 09:44:14: Deadline Monitor 6.1 [v6.1.0.54178 R]
2014-01-24 09:44:19: Time to initialize: 5.694 s
2014-01-24 09:48:49: Time to connect to Repository: 1.872 s
2014-01-24 09:48:49: Time to check user account: 16.000 ms
2014-01-24 09:48:49: Time to purge old logs: 0.000 s
2014-01-24 09:48:49: Time to synchronize plugin icons: 140.000 ms
2014-01-24 09:48:50: Time to initialize main window: 1.232 s
2014-01-24 09:48:50: Main Window shown
2014-01-24 09:48:50: Time to show main window: 187.000 ms

Now it might seem like looking at the stats that this monitor opened quick. But there is a 4-5minute mystery delay between “Time to initialize” and “Time to connect to Repository”. Not sure whats happening there

rrussell · January 24, 2014, 8:02pm

Hey Laszlo,

That’s really odd. Here’s what’s being done between:

We’re running the main monitor script.
We’re setting some initial Qt settings (QApplication stuff like style, application name, palette, etc).

That’s it. I can’t see how anything in there could take 5 minutes to run.

One thing we discovered that might be a factor is that there is a DeletedDocuments collection we’re currently not purging. The Monitor uses this collection when auto-updating to determine if it should remove stuff from its cache. As this collection grows, it can slow down the speed that the Monitor loads the jobs, slaves, etc. We’ll be addressing this for the 6.1 release so that any stubs older than 3 days will be purged, but maybe you could test on your end if purging that collection will help. Here are the steps to do it:

Use ‘mongo’ to connect to your mongo database.
Make the deadline database the current database by running this command (note that ‘deadline6db’ here is the name of the database, so if your database is named differently, use the appropriate name):

use deadline6db

Run this command:

db.DeletedDocuments.remove()

Then restart the Monitor to see if that helps.

Note that the only affect this will have on Monitors currently running is that they will miss any jobs that were deleted since their last auto-update. Restarting the Monitor will take care of that. This will NOT affect slaves, pulse, or anything else.

Cheers,
Ryan

LaszloSebo · January 24, 2014, 9:19pm

Based on the process monitor output, its cycling loading .net dlls in various locations.

Ill try the purge, but it doesnt seem like its connecting to the mongo db