We had pulse stability issues previously, but after the “Run X Job in a Separate Process” and “Write X Output to a Separate File” settings were put in for House Cleaning operations, we haven’t had any issues. There might very well be memory leaks in the house cleaning, pending job scan and repository repair processes, but since we run them in separate processes, and have their logs go to separate files to avoid clashing, we haven’t had any issues. Do you have those settings turned on?
We are running pulse on CentOS 6.5. Our current session has been going for ~6 weeks, and its restart at the time was triggered by an update (not a crash).
We have thousands of jobs (ranging between 5k-10k right now, but around march/april it was 10-18k) and slaves, with lots of scripted dependencies, so pulse is under a fairly heavy load.
In terms of the webservice, we have found that if we have thousands of queries going to it regularly, it can easily get overloaded (and thus affect pulse performance), so we try to minimize the regular queries from the farm as much as possible, or use the much less convenient command line toolset to do them. Hopefully we will at one point be able to set up a cluster of webservice servers to be able to grow the API handling without affecting core deadline functionality (this might already be possible, we have not revisited this requirement in a couple of months).
For remote submissions we use the command line toolset, which distributes the load.
According to this MSDN page, that flag means the image requires emulation of that architecture (x86). Mono x86_64 would be able to emulate x86 just fine (in the same way that x86_64 operating systems can run x86 programs), but that doesn’t mean it will force everything it runs into x86_64 addressing mode. It’s also not entirely clear whether that flag just means that the machine has to support x86, or whether the image will always run in x86 mode.
No, we don’t. I turned those off back when we were using 6.0 because the processes Pulse was spawning would frequently fail to exit. Do you see that happening on your end at all? If not, maybe I’ll try turning that back on.
Hm… I wonder if their failure to exit is actually indicative of the leaks / problems you are seeing.
We don’t have that problem, but we also have the logs redirected. I think we had some issues with them while they were still writing to the same log files, but its been so long i can’t recall the details.
OK, after enabling those options to test things, Pulse seems to be more stable, but is still spiking to really high levels of memory usage. Also, the Pending Job Scan process eats a ridiculous amount of memory. Right now, the queue has just over 2900 jobs in it, and a number of them have dependency scripts that use the .NET Mongo driver to query another database, and I’m seeing that process spike over 2.4 GB when it runs.
Not sure if this helps, but it might give you some sort of baseline comparison: our ram usage for pulse itself is around ~8G virtual / 3-4G resident. The individual processes (housecleaning, pending scan etc) can also eat up to 1+G each as well each. Our dependency scripts are fairly “low key” in terms of dependencies (libraries they pull in), they mostly do file globbing - but affect about 30-40% of our jobs.
Btw, while monitoring the processes, i noticed some crash handlers, and looking in the system log i see deadlinecommand crashes:
Jun 25 14:20:38 deadline03 kernel: deadlinecommand[4672]: segfault at 8 ip 00007f345e4f52c9 sp 00007fffa9e21660 error 4 in libQtCore.so.4[7f345e470000+2d0000]
Jun 25 14:21:15 deadline03 abrt[4821]: Saved core dump of pid 4672 (/opt/Thinkbox/Deadline7/mono/bin/mono-sgen) to /var/spool/abrt/ccpp-2015-06-25-14:20:38-4672 (199159808 bytes)
Jun 25 14:21:15 deadline03 abrtd: Directory 'ccpp-2015-06-25-14:20:38-4672' creation detected
Jun 25 14:21:15 deadline03 abrtd: Executable '/opt/Thinkbox/Deadline7/mono/bin/mono-sgen' doesn't belong to any package and ProcessUnpackaged is set to 'no'
Jun 25 14:21:15 deadline03 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2015-06-25-14:20:38-4672' exited with 1
Jun 25 14:21:15 deadline03 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2015-06-25-14:20:38-4672'
Jun 25 14:22:30 deadline03 kernel: deadlinecommand[5679]: segfault at 8 ip 00007f5621cf52c9 sp 00007fffaf955210 error 4 in libQtCore.so.4[7f5621c70000+2d0000]
Jun 25 14:23:13 deadline03 abrt[5861]: Saved core dump of pid 5679 (/opt/Thinkbox/Deadline7/mono/bin/mono-sgen) to /var/spool/abrt/ccpp-2015-06-25-14:22:30-5679 (201330688 bytes)
Jun 25 14:23:13 deadline03 abrtd: Directory 'ccpp-2015-06-25-14:22:30-5679' creation detected
Jun 25 14:23:13 deadline03 abrtd: Executable '/opt/Thinkbox/Deadline7/mono/bin/mono-sgen' doesn't belong to any package and ProcessUnpackaged is set to 'no'
Jun 25 14:23:13 deadline03 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2015-06-25-14:22:30-5679' exited with 1
Jun 25 14:23:13 deadline03 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2015-06-25-14:22:30-5679'
Jun 25 14:24:28 deadline03 kernel: deadlinecommand[6842]: segfault at 8 ip 00007fa6150f52c9 sp 00007fff6d177ff0 error 4 in libQtCore.so.4[7fa615070000+2d0000]
Jun 25 14:24:53 deadline03 abrt[6981]: Saved core dump of pid 6842 (/opt/Thinkbox/Deadline7/mono/bin/mono-sgen) to /var/spool/abrt/ccpp-2015-06-25-14:24:28-6842 (200192000 bytes)
Jun 25 14:24:53 deadline03 abrtd: Directory 'ccpp-2015-06-25-14:24:28-6842' creation detected
Jun 25 14:24:53 deadline03 abrtd: Executable '/opt/Thinkbox/Deadline7/mono/bin/mono-sgen' doesn't belong to any package and ProcessUnpackaged is set to 'no'
Jun 25 14:24:53 deadline03 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2015-06-25-14:24:28-6842' exited with 1
Jun 25 14:24:53 deadline03 abrtd: Deleting problem directory '/var/spool/abrt/ccpp-2015-06-25-14:24:28-6842'
Jun 25 14:26:08 deadline03 kernel: deadlinecommand[7757]: segfault at 8 ip 00007ffdf50f52c9 sp 00007fffaa9cbe40 error 4 in libQtCore.so.4[7ffdf5070000+2d0000]
So i think insulating these commands as separate processes helps a lot, cause these crashes can go on, without affecting pulse itself.
You’d think so… but these processes are all using ridiculous amounts of RAM, so it’s kind of a toss-up which one will crash when they over-allocate the system.
We haven’t had pulse crashes in several months - i can’t actually recall any crashes since we separated the cleanup processes,… So the problem is most likely in one of those.
Our pulse machine also serves other services so it was always running on a generously allocated machine (currently @32G ram).
While my inner engineer says “yeah it should probably run more memory efficiently”, when it comes to managed languages, i’m not surprised by anything anymore (my java based chat application is using 1.8Gigs right now).
So I’ve updated one of our locations to 7.1.2.1, and I’m still seeing Pulse crashes, even while continuing to use separate processes for the pending job scan, housecleaning, etc.
Just a thought from guys here. Could you try running the web service and have machines connect to that. At the very least it might show if the crash is in the web service or in the rest of Pulse.
“The .NET assemblies are compiled to target ‘AnyCPU’, which means they’ll only run as 32 bit if Mono is 32 bit. However, the version of Mono we ship with Deadline is 64 bit, so the Deadline applications will run as 64 bit applications. We even have a SystemUtils.Is64Bit() API function, and I’ve checked it in a script to confirm that Deadline is 64 bit.”
Now, we also need to make sure Pulse isn’t still shouldering a lot of this. Do you have some time tomorrow or Friday for a remote session? I have a neat thing I want to try as well. Has something to do with Mono threading. Basically, there’s an environment variable that might help some really low level stuff that Mono itself dies on. These are the options: github.com/TechEmpower/Framewor … issues/823
I’ve also read that Linux Kernel 4.0 might help this out too, but I don’t have time to test that at the moment.
Unfortunately, remote sessions are almost definitely out of the question due to security restrictions. I’m happy to try running more tests on my end and getting you any resulting data, although I usually have to plan them well, since I’m operating on the live production instance.