We occasionally have problems with clients being unable to contact our Pulse server; and, a separate problem where we find once every week or two that the deadlinepulse process has gone away on the server. Looking at the log doesn’t seem to show any errors or unusual reports (it would help if the entries in the logfile were timestamped, by the way).
This is not a crisis problem, but it recurs now and then so we’d like to see if there are any configuration tweaks we can do.
Specifically, when we look on the Pulse server, the deadlinepulse process is always using 100% of the CPU. The amount of memory really isn’t that high (and the memory usage seems to be quite stable). Here’s the line from top:
top - 12:15:01 up 33 days, 22:03, 2 users, load average: 1.58, 1.43, 1.37
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
18339 root 15 0 714m 199m 6072 S 98.7 38.9 896:47.31 deadlinepulse
Note that even though the CPU is pegged, the load average isn’t especially high, so it’s getting around to running all the threads reasonably quickly.
If we look at the threads within the deadlinepulse process, one thread (presumably it’s the master thread) is responsible for almost all the CPU usage. I’m guessing the other threads are the repository read threads, which probably get relaunched every repository read interval.
[leo.hourvitz@tarvos]~% ps uH p 18339
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 18339 0.0 38.9 732052 204116 ? Sl Apr06 0:35 mono /usr/local
root 18339 0.0 38.9 732052 204116 ? Sl Apr06 0:00 mono /usr/local
root 18339 0.0 38.9 732052 204116 ? Sl Apr06 0:00 mono /usr/local
root 18339 0.0 38.9 732052 204116 ? Sl Apr06 0:00 mono /usr/local
root 18339 0.0 38.9 732052 204116 ? Sl Apr06 0:00 mono /usr/local
root 18339 0.1 38.9 732052 204116 ? Sl Apr06 1:56 mono /usr/local
root 18339 2.4 38.9 732052 204116 ? Sl Apr06 23:29 mono /usr/local
root 18339 87.9 38.9 732052 204116 ? Rl Apr06 856:35 mono /usr/local
root 18339 0.0 38.9 732052 204116 ? Sl Apr06 0:00 mono /usr/local
root 18339 0.0 38.9 732052 204116 ? Sl 11:57 0:00 mono /usr/local
root 18339 0.0 38.9 732052 204116 ? Sl 12:01 0:00 mono /usr/local
root 18339 0.0 38.9 732052 204116 ? Sl 12:01 0:00 mono /usr/local
We have about 300 slaves but this situation persists even when most of them are idle and/or offline. For instance, right now, there are 85 active slaves and 93 idle slaves but as described above the Pulse server is slammed. In fact, every time we’ve ever looked the Pulse server is at 100% CPU. Now, our Pulse server isn’t a fast machine or anything, but it is dedicated to Pulse (the repository server is separate), so we’re not concerned with reducing the CPU load for its own sake; we’re perfectly happy to have this machine run flat out servicing Pulse requests. I’m only mentioning that because it may be related to the minor reliability issues.
All of our Pulse-related repository settings are the default values. To alleviate the problem where occasionally clients can’t contact Pulse, it seems like it might be a good idea to raise the “Message Timeout in Milliseconds”, perhaps to 1000 or 2000?
The other thing that I’m thinking when looking at this is that perhaps we should raise the “Repository Scan Interval” from the current value of 5 seconds. The top and ps output suggests that our Pulse server is constantly scanning the repository, which I doubt was the intention! Our repository currently has about 1200 non-archived jobs in it (plus thousands of archived jobs).
However, it’s not clear either of those would explain why we see the Pulse process go away once every week or two. Any advice welcomed, particularly if anyone who has experience tweaking those repository parameters.
Leo