Pulse leaking memory

Hey Nathan,

We had pulse stability issues previously, but after the “Run X Job in a Separate Process” and “Write X Output to a Separate File” settings were put in for House Cleaning operations, we haven’t had any issues. There might very well be memory leaks in the house cleaning, pending job scan and repository repair processes, but since we run them in separate processes, and have their logs go to separate files to avoid clashing, we haven’t had any issues. Do you have those settings turned on?

We are running pulse on CentOS 6.5. Our current session has been going for ~6 weeks, and its restart at the time was triggered by an update (not a crash).

We have thousands of jobs (ranging between 5k-10k right now, but around march/april it was 10-18k) and slaves, with lots of scripted dependencies, so pulse is under a fairly heavy load.

In terms of the webservice, we have found that if we have thousands of queries going to it regularly, it can easily get overloaded (and thus affect pulse performance), so we try to minimize the regular queries from the farm as much as possible, or use the much less convenient command line toolset to do them. Hopefully we will at one point be able to set up a cluster of webservice servers to be able to grow the API handling without affecting core deadline functionality (this might already be possible, we have not revisited this requirement in a couple of months).

For remote submissions we use the command line toolset, which distributes the load.

cheers
laszlo

7.1 has the separate web service. The load balancing between servers will need to be implemented on the client side.

Oh that’s great! We will look into using that in the very near future.

With my PE reader, I’m seeing this:

... COFF/File header Machine: 0x14c IMAGE_FILE_MACHINE_I386 ...

According to this MSDN page, that flag means the image requires emulation of that architecture (x86). Mono x86_64 would be able to emulate x86 just fine (in the same way that x86_64 operating systems can run x86 programs), but that doesn’t mean it will force everything it runs into x86_64 addressing mode. It’s also not entirely clear whether that flag just means that the machine has to support x86, or whether the image will always run in x86 mode.

No, we don’t. I turned those off back when we were using 6.0 because the processes Pulse was spawning would frequently fail to exit. Do you see that happening on your end at all? If not, maybe I’ll try turning that back on.

Hm… I wonder if their failure to exit is actually indicative of the leaks / problems you are seeing.
We don’t have that problem, but we also have the logs redirected. I think we had some issues with them while they were still writing to the same log files, but its been so long i can’t recall the details.

OK, after enabling those options to test things, Pulse seems to be more stable, but is still spiking to really high levels of memory usage. Also, the Pending Job Scan process eats a ridiculous amount of memory. Right now, the queue has just over 2900 jobs in it, and a number of them have dependency scripts that use the .NET Mongo driver to query another database, and I’m seeing that process spike over 2.4 GB when it runs.

Not sure if this helps, but it might give you some sort of baseline comparison: our ram usage for pulse itself is around ~8G virtual / 3-4G resident. The individual processes (housecleaning, pending scan etc) can also eat up to 1+G each as well each. Our dependency scripts are fairly “low key” in terms of dependencies (libraries they pull in), they mostly do file globbing - but affect about 30-40% of our jobs.

Btw, while monitoring the processes, i noticed some crash handlers, and looking in the system log i see deadlinecommand crashes:

Jun 25 14:20:38 deadline03 kernel: deadlinecommand[4672]: segfault at 8 ip 00007f345e4f52c9 sp 00007fffa9e21660 error 4 in libQtCore.so.4[7f345e470000+2d0000]
Jun 25 14:21:15 deadline03 abrt[4821]: Saved core dump of pid 4672 (/opt/Thinkbox/Deadline7/mono/bin/mono-sgen) to /var/spool/abrt/ccpp-2015-06-25-14:20:38-4672 (199159808 bytes)
Jun 25 14:21:15 deadline03 abrtd: Directory 'ccpp-2015-06-25-14:20:38-4672' creation detected
Jun 25 14:21:15 deadline03 abrtd: Executable '/opt/Thinkbox/Deadline7/mono/bin/mono-sgen' doesn't belong to any package and ProcessUnpackaged is set to 'no'
Jun 25 14:21:15 deadline03 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2015-06-25-14:20:38-4672' exited with 1
Jun 25 14:21:15 deadline03 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2015-06-25-14:20:38-4672'
Jun 25 14:22:30 deadline03 kernel: deadlinecommand[5679]: segfault at 8 ip 00007f5621cf52c9 sp 00007fffaf955210 error 4 in libQtCore.so.4[7f5621c70000+2d0000]
Jun 25 14:23:13 deadline03 abrt[5861]: Saved core dump of pid 5679 (/opt/Thinkbox/Deadline7/mono/bin/mono-sgen) to /var/spool/abrt/ccpp-2015-06-25-14:22:30-5679 (201330688 bytes)
Jun 25 14:23:13 deadline03 abrtd: Directory 'ccpp-2015-06-25-14:22:30-5679' creation detected
Jun 25 14:23:13 deadline03 abrtd: Executable '/opt/Thinkbox/Deadline7/mono/bin/mono-sgen' doesn't belong to any package and ProcessUnpackaged is set to 'no'
Jun 25 14:23:13 deadline03 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2015-06-25-14:22:30-5679' exited with 1
Jun 25 14:23:13 deadline03 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2015-06-25-14:22:30-5679'
Jun 25 14:24:28 deadline03 kernel: deadlinecommand[6842]: segfault at 8 ip 00007fa6150f52c9 sp 00007fff6d177ff0 error 4 in libQtCore.so.4[7fa615070000+2d0000]
Jun 25 14:24:53 deadline03 abrt[6981]: Saved core dump of pid 6842 (/opt/Thinkbox/Deadline7/mono/bin/mono-sgen) to /var/spool/abrt/ccpp-2015-06-25-14:24:28-6842 (200192000 bytes)
Jun 25 14:24:53 deadline03 abrtd: Directory 'ccpp-2015-06-25-14:24:28-6842' creation detected
Jun 25 14:24:53 deadline03 abrtd: Executable '/opt/Thinkbox/Deadline7/mono/bin/mono-sgen' doesn't belong to any package and ProcessUnpackaged is set to 'no'
Jun 25 14:24:53 deadline03 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2015-06-25-14:24:28-6842' exited with 1
Jun 25 14:24:53 deadline03 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2015-06-25-14:24:28-6842'
Jun 25 14:26:08 deadline03 kernel: deadlinecommand[7757]: segfault at 8 ip 00007ffdf50f52c9 sp 00007fffaa9cbe40 error 4 in libQtCore.so.4[7ffdf5070000+2d0000]

So i think insulating these commands as separate processes helps a lot, cause these crashes can go on, without affecting pulse itself.

You’d think so… but these processes are all using ridiculous amounts of RAM, so it’s kind of a toss-up which one will crash when they over-allocate the system.

We haven’t had pulse crashes in several months - i can’t actually recall any crashes since we separated the cleanup processes,… So the problem is most likely in one of those.

Our pulse machine also serves other services so it was always running on a generously allocated machine (currently @32G ram).
While my inner engineer says “yeah it should probably run more memory efficiently”, when it comes to managed languages, i’m not surprised by anything anymore (my java based chat application is using 1.8Gigs right now).

I still feel any crashes are unacceptable. When I have some time I’ll brush up on my Mono debugging.

Yep, agreed.

As you can see something is crashing basically every time it runs (for us anyway… could be caused by one of our custom scripts…):

Jul  2 10:19:08 deadline03 kernel: deadlinecommand[26338]: segfault at 8 ip 00007f36aa8712c9 sp 00007fff0b4a8230 error 4 in libQtCore.so.4[7f36aa7ec000+2d0000]
Jul  2 10:20:56 deadline03 kernel: deadlinecommand[27278]: segfault at 8 ip 00007fd9ffeed2c9 sp 00007fffc4003290 error 4 in libQtCore.so.4[7fd9ffe68000+2d0000]
Jul  2 10:23:01 deadline03 kernel: deadlinecommand[28701]: segfault at 8 ip 00007f850f6f52c9 sp 00007ffff8211790 error 4 in libQtCore.so.4[7f850f670000+2d0000]
Jul  2 10:24:56 deadline03 kernel: deadlinecommand[29958]: segfault at 8 ip 00007fcaa38fe2c9 sp 00007fff6dee64f0 error 4 in libQtCore.so.4[7fcaa3879000+2d0000]
Jul  2 10:26:55 deadline03 kernel: deadlinecommand[31337]: segfault at 8 ip 00007fbbb92f52c9 sp 00007fffa3b46940 error 4 in libQtCore.so.4[7fbbb9270000+2d0000]
Jul  2 10:28:51 deadline03 kernel: deadlinecommand[330]: segfault at 8 ip 00007f09265f52c9 sp 00007fffe7133270 error 4 in libQtCore.so.4[7f0926570000+2d0000]
Jul  2 10:30:59 deadline03 kernel: deadlinecommand[1757]: segfault at 8 ip 00007fa862bf52c9 sp 00007fffd5a27b90 error 4 in libQtCore.so.4[7fa862b70000+2d0000]
Jul  2 10:33:03 deadline03 kernel: deadlinecommand[3030]: segfault at 8 ip 00007f59c13712c9 sp 00007fffb5033870 error 4 in libQtCore.so.4[7f59c12ec000+2d0000]
Jul  2 10:34:46 deadline03 kernel: deadlinecommand[4119]: segfault at 8 ip 00007f9f9bced2c9 sp 00007fffc38a5670 error 4 in libQtCore.so.4[7f9f9bc68000+2d0000]
Jul  2 10:36:41 deadline03 kernel: deadlinecommand[5360]: segfault at 8 ip 00007f7d2ef712c9 sp 00007ffff90197d0 error 4 in libQtCore.so.4[7f7d2eeec000+2d0000]
Jul  2 10:38:48 deadline03 kernel: deadlinecommand[6724]: segfault at 8 ip 00007f34c8f712c9 sp 00007fff6b68a9a0 error 4 in libQtCore.so.4[7f34c8eec000+2d0000]
Jul  2 10:40:45 deadline03 kernel: deadlinecommand[8123]: segfault at 8 ip 00007f02e3af52c9 sp 00007fff847cba00 error 4 in libQtCore.so.4[7f02e3a70000+2d0000]
Jul  2 10:42:42 deadline03 kernel: deadlinecommand[9419]: segfault at 8 ip 00007fe24a4f52c9 sp 00007fff7dfbbd90 error 4 in libQtCore.so.4[7fe24a470000+2d0000]
Jul  2 10:44:37 deadline03 kernel: deadlinecommand[10635]: segfault at 8 ip 00007f545eced2c9 sp 00007fff952744f0 error 4 in libQtCore.so.4[7f545ec68000+2d0000]
Jul  2 10:46:42 deadline03 kernel: deadlinecommand[11821]: segfault at 8 ip 00007f9c21eed2c9 sp 00007fff1ba188f0 error 4 in libQtCore.so.4[7f9c21e68000+2d0000]
Jul  2 10:48:29 deadline03 kernel: deadlinecommand[12916]: segfault at 8 ip 00007fed3a7712c9 sp 00007fff05c8f9d0 error 4 in libQtCore.so.4[7fed3a6ec000+2d0000]
Jul  2 10:50:26 deadline03 kernel: deadlinecommand[14027]: segfault at 8 ip 00007fd00a8f52c9 sp 00007fffd1d4dde0 error 4 in libQtCore.so.4[7fd00a870000+2d0000]
Jul  2 10:52:23 deadline03 kernel: deadlinecommand[15133]: segfault at 8 ip 00007f591b5ed2c9 sp 00007fffbf8bc220 error 4 in libQtCore.so.4[7f591b568000+2d0000]
Jul  2 10:54:08 deadline03 kernel: deadlinecommand[16230]: segfault at 8 ip 00007fe603ba52c9 sp 00007fff37e89de0 error 4 in libQtCore.so.4[7fe603b20000+2d0000]
Jul  2 10:56:01 deadline03 kernel: deadlinecommand[17574]: segfault at 8 ip 00007f7ebeaed2c9 sp 00007fff6af996b0 error 4 in libQtCore.so.4[7f7ebea68000+2d0000]
Jul  2 10:58:06 deadline03 kernel: deadlinecommand[18749]: segfault at 8 ip 00007f87a3ba52c9 sp 00007fff3f5943f0 error 4 in libQtCore.so.4[7f87a3b20000+2d0000]
Jul  2 11:00:06 deadline03 kernel: deadlinecommand[20032]: segfault at 8 ip 00007f715d3f52c9 sp 00007fff6c525490 error 4 in libQtCore.so.4[7f715d370000+2d0000]
Jul  2 11:02:13 deadline03 kernel: deadlinecommand[21212]: segfault at 8 ip 00007fa32e7f52c9 sp 00007fffb4808e60 error 4 in libQtCore.so.4[7fa32e770000+2d0000]
Jul  2 11:03:58 deadline03 kernel: deadlinecommand[22435]: segfault at 8 ip 00007fba056712c9 sp 00007fff0ce51010 error 4 in libQtCore.so.4[7fba055ec000+2d0000]
Jul  2 11:06:17 deadline03 kernel: deadlinecommand[23949]: segfault at 8 ip 00007f2a53af52c9 sp 00007fff3e9768a0 error 4 in libQtCore.so.4[7f2a53a70000+2d0000]
Jul  2 11:08:26 deadline03 kernel: deadlinecommand[25103]: segfault at 8 ip 00007f0e021712c9 sp 00007fff2ef64600 error 4 in libQtCore.so.4[7f0e020ec000+2d0000]
Jul  2 11:10:37 deadline03 kernel: deadlinecommand[26376]: segfault at 8 ip 00007f20138fe2c9 sp 00007fffe6ad7ea0 error 4 in libQtCore.so.4[7f2013879000+2d0000]
Jul  2 11:12:40 deadline03 kernel: deadlinecommand[27598]: segfault at 8 ip 00007ff2330f52c9 sp 00007fff7b1939a0 error 4 in libQtCore.so.4[7ff233070000+2d0000]
Jul  2 11:14:52 deadline03 kernel: deadlinecommand[28680]: segfault at 8 ip 00007f92ed8f52c9 sp 00007fffd8bc35a0 error 4 in libQtCore.so.4[7f92ed870000+2d0000]
Jul  2 11:17:03 deadline03 kernel: deadlinecommand[29937]: segfault at 8 ip 00007f7cada692c9 sp 00007fff61f3ed90 error 4 in libQtCore.so.4[7f7cad9e4000+2d0000]
Jul  2 11:19:25 deadline03 kernel: deadlinecommand[31120]: segfault at 8 ip 00007fad3b1712c9 sp 00007fff8c52c070 error 4 in libQtCore.so.4[7fad3b0ec000+2d0000]
Jul  2 11:21:23 deadline03 kernel: deadlinecommand[32364]: segfault at 8 ip 00007ff5425f52c9 sp 00007fff251c4510 error 4 in libQtCore.so.4[7ff542570000+2d0000]
Jul  2 11:23:13 deadline03 kernel: deadlinecommand[1143]: segfault at 8 ip 00007fb058ced2c9 sp 00007fffe35d5f80 error 4 in libQtCore.so.4[7fb058c68000+2d0000]
Jul  2 11:25:25 deadline03 kernel: deadlinecommand[2755]: segfault at 8 ip 00007f06de1712c9 sp 00007fff3c3488b0 error 4 in libQtCore.so.4[7f06de0ec000+2d0000]
Jul  2 11:27:28 deadline03 kernel: deadlinecommand[3990]: segfault at 8 ip 00007fe198d712c9 sp 00007fff0f399820 error 4 in libQtCore.so.4[7fe198cec000+2d0000]
Jul  2 11:29:44 deadline03 kernel: deadlinecommand[5383]: segfault at 8 ip 00007f823eded2c9 sp 00007fffc8c92030 error 4 in libQtCore.so.4[7f823ed68000+2d0000]
Jul  2 11:31:38 deadline03 kernel: deadlinecommand[6468]: segfault at 8 ip 00007f3d4a9f52c9 sp 00007ffff4d7eac0 error 4 in libQtCore.so.4[7f3d4a970000+2d0000]
Jul  2 11:33:33 deadline03 kernel: deadlinecommand[7680]: segfault at 8 ip 00007fb129ced2c9 sp 00007fffa49b6970 error 4 in libQtCore.so.4[7fb129c68000+2d0000]
Jul  2 11:35:29 deadline03 kernel: deadlinecommand[8716]: segfault at 8 ip 00007f0903d612c9 sp 00007fff1c01bb30 error 4 in libQtCore.so.4[7f0903cdc000+2d0000]
Jul  2 11:37:35 deadline03 kernel: deadlinecommand[10239]: segfault at 8 ip 00007fddefd612c9 sp 00007fffde68c710 error 4 in libQtCore.so.4[7fddefcdc000+2d0000]
Jul  2 11:39:31 deadline03 kernel: deadlinecommand[11364]: segfault at 8 ip 00007f3390a712c9 sp 00007fffda3c4c60 error 4 in libQtCore.so.4[7f33909ec000+2d0000]

seems like its thrown by the pending job test, before it even finishes running:

[root@deadline03 ~]# tail /var/log/messages | grep deadlinecommand
Jul 2 11:43:49 deadline03 kernel: deadlinecommand[14157]: segfault at 8 ip 00007fc1be2e52c9 sp 00007fffb8dab6b0 error 4 in libQtCore.so.4[7fc1be260000+2d0000]

[root@deadline03 ~]# ps aux | grep deadlinecommand
root 13138 56.0 1.8 1162632 621304 ? Rl 11:41 1:17 /opt/Thinkbox/Deadline7/mono/bin/mono-sgen /opt/Thinkbox/Deadline7/bin/deadlinecommand.exe -DoHouseCleaning False True False
root 14157 38.9 0.5 837544 166940 ? S 11:43 0:10 /opt/Thinkbox/Deadline7/mono/bin/mono-sgen /opt/Thinkbox/Deadline7/bin/deadlinecommand.exe -DoPendingJobScan False True False none
root 14421 0.0 0.0 103252 832 pts/2 S+ 11:44 0:00 grep deadlinecommand

Yeah, the pending job scan is definitely one of the biggest memory hogs in the Pulse arsenal.

So I’ve updated one of our locations to 7.1.2.1, and I’m still seeing Pulse crashes, even while continuing to use separate processes for the pending job scan, housecleaning, etc.

Just a thought from guys here. Could you try running the web service and have machines connect to that. At the very least it might show if the crash is in the web service or in the rest of Pulse.

Just tried that. Pulse continues to die, even though the web service is running as a separate process.

Can anyone at Thinkbox confirm the bit-ness of the actual .NET assemblies?

Hey Nathan, from the Deadline room:

“The .NET assemblies are compiled to target ‘AnyCPU’, which means they’ll only run as 32 bit if Mono is 32 bit. However, the version of Mono we ship with Deadline is 64 bit, so the Deadline applications will run as 64 bit applications. We even have a SystemUtils.Is64Bit() API function, and I’ve checked it in a script to confirm that Deadline is 64 bit.”

Now, we also need to make sure Pulse isn’t still shouldering a lot of this. Do you have some time tomorrow or Friday for a remote session? I have a neat thing I want to try as well. Has something to do with Mono threading. Basically, there’s an environment variable that might help some really low level stuff that Mono itself dies on. These are the options:
github.com/TechEmpower/Framewor … issues/823

I’ve also read that Linux Kernel 4.0 might help this out too, but I don’t have time to test that at the moment.

Unfortunately, remote sessions are almost definitely out of the question due to security restrictions. I’m happy to try running more tests on my end and getting you any resulting data, although I usually have to plan them well, since I’m operating on the live production instance.