khugepaged

LaszloSebo · November 27, 2013, 1:37am

Hi there,

We started using the sgen garbage collector today, and then later in the day started noticing a problem where the deadline db machine would essentially hang up for 10-30 seconds (sometimes longer) every 15 mins or so. Very random, and it took about 4 hours for me to notice the hanging (so im not sure how long it took for that to start happening).

Looking at ‘top’ when the hang was happening, we noticed the khugepaged process taking complete precedence on one cpu, while everything else was essentially hanging.

Here is a related linux bugreport:

bugs.centos.org/view.php?id=5716

I have now disabled that service, but im not sure if thats the proper thing to do

rrussell · November 27, 2013, 2:44pm

If you’re running the beta 12 version of Pulse, you should try using the default boehm garbage collector again instead of the sgen one. The memory usage of Pulse should be much lower, so you shouldn’t run into GC issues any more. Also, by using the original GC again, you can determine if the issues your are seeing here are indeed related to using the sgen GC.

rrussell · November 27, 2013, 2:58pm

You’re running Pulse on the same machine as the Mongo database right? Have you guys tried running it on a separate machine to see if that helps things at all? When Pulse is doing heavier operations like housecleaning, that could potentially impact Mongo’s performance during that time.

LaszloSebo · November 27, 2013, 5:18pm

The average ram usage of beta12 pulse seems to be lower than beta11, but not by a lot (we are not running the webservice). Its currently around 2.3G

What i did notice however is that it frees ram much more often. It can go up to 2.5-3g, then drop back to 1.9G.

Will try boehm again (although, with hugepaged off, its running pretty smooth, no crash overnight)