Pulse CPU Usage

We occasionally have problems with clients being unable to contact our Pulse server; and, a separate problem where we find once every week or two that the deadlinepulse process has gone away on the server. Looking at the log doesn’t seem to show any errors or unusual reports (it would help if the entries in the logfile were timestamped, by the way).

This is not a crisis problem, but it recurs now and then so we’d like to see if there are any configuration tweaks we can do.

Specifically, when we look on the Pulse server, the deadlinepulse process is always using 100% of the CPU. The amount of memory really isn’t that high (and the memory usage seems to be quite stable). Here’s the line from top:

top - 12:15:01 up 33 days, 22:03, 2 users, load average: 1.58, 1.43, 1.37 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 18339 root 15 0 714m 199m 6072 S 98.7 38.9 896:47.31 deadlinepulse

Note that even though the CPU is pegged, the load average isn’t especially high, so it’s getting around to running all the threads reasonably quickly.

If we look at the threads within the deadlinepulse process, one thread (presumably it’s the master thread) is responsible for almost all the CPU usage. I’m guessing the other threads are the repository read threads, which probably get relaunched every repository read interval.

[leo.hourvitz@tarvos]~% ps uH p 18339 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 18339 0.0 38.9 732052 204116 ? Sl Apr06 0:35 mono /usr/local root 18339 0.0 38.9 732052 204116 ? Sl Apr06 0:00 mono /usr/local root 18339 0.0 38.9 732052 204116 ? Sl Apr06 0:00 mono /usr/local root 18339 0.0 38.9 732052 204116 ? Sl Apr06 0:00 mono /usr/local root 18339 0.0 38.9 732052 204116 ? Sl Apr06 0:00 mono /usr/local root 18339 0.1 38.9 732052 204116 ? Sl Apr06 1:56 mono /usr/local root 18339 2.4 38.9 732052 204116 ? Sl Apr06 23:29 mono /usr/local root 18339 87.9 38.9 732052 204116 ? Rl Apr06 856:35 mono /usr/local root 18339 0.0 38.9 732052 204116 ? Sl Apr06 0:00 mono /usr/local root 18339 0.0 38.9 732052 204116 ? Sl 11:57 0:00 mono /usr/local root 18339 0.0 38.9 732052 204116 ? Sl 12:01 0:00 mono /usr/local root 18339 0.0 38.9 732052 204116 ? Sl 12:01 0:00 mono /usr/local

We have about 300 slaves but this situation persists even when most of them are idle and/or offline. For instance, right now, there are 85 active slaves and 93 idle slaves but as described above the Pulse server is slammed. In fact, every time we’ve ever looked the Pulse server is at 100% CPU. Now, our Pulse server isn’t a fast machine or anything, but it is dedicated to Pulse (the repository server is separate), so we’re not concerned with reducing the CPU load for its own sake; we’re perfectly happy to have this machine run flat out servicing Pulse requests. I’m only mentioning that because it may be related to the minor reliability issues.

All of our Pulse-related repository settings are the default values. To alleviate the problem where occasionally clients can’t contact Pulse, it seems like it might be a good idea to raise the “Message Timeout in Milliseconds”, perhaps to 1000 or 2000?

The other thing that I’m thinking when looking at this is that perhaps we should raise the “Repository Scan Interval” from the current value of 5 seconds. The top and ps output suggests that our Pulse server is constantly scanning the repository, which I doubt was the intention! Our repository currently has about 1200 non-archived jobs in it (plus thousands of archived jobs).

However, it’s not clear either of those would explain why we see the Pulse process go away once every week or two. Any advice welcomed, particularly if anyone who has experience tweaking those repository parameters.

Leo

Hi Leo,

A few initial questions:

  1. Which version of Deadline are you using?
  2. Which version of Mono do you have installed on the Pulse machine?
  3. What are the specs of the Pulse machine (CPUs, RAM, OS, etc)?

The Deadline logs have had time stamps for a while (I can’t remember which version they were added though). Are you looking at the console, or the actual Pulse log file? There should be a logs folder in the Deadline Client install folder that contains these logs. If you find a log from a crashed session, feel free to post it and we’ll take a look.

This information should help us proceed in figuring out this problem.

Thanks!

  • Ryan

Hi Russell,

We’re on Deadline 4.1 (specifically 4.1.0.43205).

Here’s the mono --version output:
leo.hourvitz tarvos.ppi.co.jp /home/leo.hourvitz % mono --version
Mono JIT compiler version 2.6.4 (tarball 2010年 7月 6日 火曜日 19:44:15 JST)
Copyright © 2002-2010 Novell, Inc and Contributors. www.mono-project.com
TLS: __thread
GC: Included Boehm (with typed GC and Parallel Mark)
SIGSEGV: altstack
Notifications: epoll
Architecture: amd64
Disabled: none

Our Pulse host is a virtual machine so the hardware specs aren’t directly applicable… It has a 500MB memory partition as sort of implied by the top output (200MB/40% => 500MB). We realized this isn’t a heavyweight machine, we’re just puzzled by why Pulse saturated even this lightweight machine no matter what the situation is…

The file I was looking at that did not have timestamps was /var/log/deadlinepulse.log. Thanks to your message, I found /usr/local/Prime_Focus/Deadline4.1SP1/logs as well, which as you said do have timestamps!

One thing you could try is upgrading to Mono 2.6.7. This is the latest stable release, and we definitely recommend it for Deadline 4.x:
go-mono.com/mono-downloads/download.html

If that doesn’t help, it would be great if we could figure out where the bulk of cpu cycles are being used. One way to do this would be to run Pulse without having any slaves connecting to it. You could probably do this by cloning your VM. Since the new VM should have a different hostname/IP, no slaves will actually connect to it, and Pulse will just sit there updating its data. If the CPU is still pegged, then we know the issue is with Pulse updating its data from the repository.

A couple more questions:

  1. How many CPUs does this VM have?
  2. What protocol are you using to mount the repository share?

Cheers,

  • Ryan

Hi Russell and all,

We’ve tried to address this situation by moving to a faster server, but that’s actually caused the overall problem to become much worse and it has now reached crisis proportions.

As a reminder, we’re on Deadline 4.1. We migrated the Pulse server to another machine (Windows XP64, 8GB RAM, dual core). On this system, CPU usage hovers just above 50% most of the time; we see occasional spikes to 100% but it’s not pegged there for long periods of time.

However, every night when our render farm gets going, every machine (both Slaves and Monitor) decides that Pulse is down, and starts trying to read the repository files directly. This is of course an unmitigated disaster; 400+ machines reading 1000s of small XML files from a Samba share causes overall Deadline performance to fall to a fraction of the needed volume. The load average on our Samba server goes up to 250+ because of the thousands of tiny read requests coming in.

At the moment our repository contains 1,072 active/queued/failed/completed job, and about 31,000 archived jobs. Is the quantity of archived jobs a problem? The total storage used is about 1.3TB but the jobs/ directory only accounts for 14GB of that. Is the jobs directory the most active part of the repository? That directory could possibly be migrated to an SSD although we don’t have ~2TB of SSD storage sitting around anywhere.

The root of the problem of course isn’t Samba – this is why you provide Pulse! The problem is that for some reason all of our Deadline applications decide Pulse is down even though as far as we can see Pulse never crashes or completely goes offline. We desperately need to fix this; our throughout put is 50% or less of what it was two weeks ago when Pulse was on the slower machine. We have already tried doubling the PulseMaxConnectionAttempts and PulseMessageTimeoutMilliseconds. Perhaps we should consider increasing them by another 10x? Should we dramatically increase the Repositry Scan Interval to try and reduce the change Pulse is unresponsive because it’s doing Repository Scans? Also, we don’t currently use power management so we’d be happy to turn that off for the moment but the Power Management Check Interval doesn’t say whether turning that setting to 0 would be OK or not. We have about 12 hours to try and adjust things before tonight’s rendering starts.

Any suggestions you have for settings we could change to reduce to near zero the probability that all of our machines will decide to bypass Pulse would be greatly appreciated in the short term. Frankly at this point we don’t care if it reduces the overall reliability of the farm; we simply need to get back to where Deadline functions most of the time before worrying about long-term reliability.

Of course, we’re also trying to figure out why we’re seeing this when you have other customers (presumably with larger Windows-based render farms) who don’t have the problem. One thing about us that’s probably different than many places is that we have a system whereby a background process wakes up and runs Deadline Slave on the machine automatically at logout. Thus, every evening between about 8pm and 10pm a couple hundred Slaves add themselves to the farm in a fairly short period of time. Are there any effects of that that might make Pulse temporarily unresponsive?

We have no idea whether it’s constant or not, but we see these message in many of the Slave’s logs during the time when things have gone horribly wrong (our new Pulse server described above is called “keikou”):

Scheduler - Contacting Deadline Pulse running on "keikou.ppi.co.jp", port 5046. Scheduler - Requesting work from Deadline Pulse... Scheduler - Pulse returned an invalid message.

There’s no way to run multiple redundant Pulse servers on one repository, huh?

Believe me we are hoping to hear from you soon!

Leo
networkSettings.xml (6.57 KB)

Two more possibly relevant pieces of information:

  1. When you look around on the repository server, we see hundreds of files named “connectReadWriteTest.XXXXX”.
    These seem to come and go, but they’re present a lot of the time.

% ls Deadline4.xRepository ... connectReadWriteTest.T66ge* connectReadWriteTest.zv0JP* connectReadWriteTest.TBMFi* connectReadWriteTest.zzCrl* connectReadWriteTest.TE7Li* corrupted/ connectReadWriteTest.TgUJP* defaults/ connectReadWriteTest.Tq27I* flexlmTools/ connectReadWriteTest.TwFks* jobs.backup/ connectReadWriteTest.UC2JX* jobs/ connectReadWriteTest.UDwcK* jobsArchived/ connectReadWriteTest.UQ2KG* jobsDropped/ connectReadWriteTest.UR8v4* limitGroups/ connectReadWriteTest.UgpGH* plugins/ connectReadWriteTest.Uq9E1* pulse/ connectReadWriteTest.UuXPE* reports/ connectReadWriteTest.VAYXn* scripts/ connectReadWriteTest.VDdwO* settings/ connectReadWriteTest.VHbRT* slaves/ connectReadWriteTest.VHrLj* submission/ connectReadWriteTest.VVr5p* temp/ connectReadWriteTest.Vir6t* trash/ connectReadWriteTest.Vy30n* uninstall* connectReadWriteTest.W7NvS* users/

  1. As we mentioned, we upgraded to a faster Pulse server, and things got worse even though on the new server the Pulse CPU load was lower. However, we didn’t mention another possibly salient part: the new server is Windows, whereas the old server was Linux. We’re wondering whether somehow Windows has some networking hangup when used as a Pulse server?

Our repository server is Linux, but it shares the repository via CIFS even when the Pulse server is also Linux.

  1. Finally, is there any way for us to talk to someone on the phone? We’re pretty much at the we-can’t-go-home-until-it’s-solved stage.

We think we understood why the new, faster server made things worse now. It was running Windows… not Windows Server… and it looks like there’s a pretty low limit on how many concurrent TCP/IP connections a Windows client can support.

We’ve temporarily reverted to the old Pulse server, which means we are back in the 100% CPU usage on Pulse situation again. We still would really like to know how to configure things such that Deadline clients are extremely unlikely to ever “go around” Pulse and read the repository directly.

In the meantime, we’re preparing a higher-performance Linux machine to migrate Pulse to as well.

Hi Leo,

I didn’t realize that non-server Windows had a TCP/IP connection limitation as well, but that would definitely explain why the problem escalates once those couple hundred slaves add themselves to the farm. Hopefully you’ll get better results with the new Pulse machine once you get it online.

There are a couple of things you can do to offset this problem though in the meantime. The first is to consider clearing out some/all of your archived jobs. Archived jobs aren’t scanned when Pulse is looking for a job to render, so archiving jobs helps performance in that sense, but Pulse still loads these in when scanning the repository (so it can provide that Monitor with that information). Maybe we shouldn’t be doing that, but that’s something we’ll have to look at later.

Another thing you can do is increase the Delay Between Repository Polls setting under the Slave Settings in the Repository Options. Currently, it’s set to 70 seconds, but maybe consider increasing it to 120 or 180 seconds (2-3 minutes). That way, if the connection to Pulse gets lost, the repository won’t get hit as often. Bumping up the Delay Between Pulse Polls to 20 might help alleviate Pulse a bit as well.

The connectReadWriteTest files are just the Slaves ensuring they have read/write permissions when they start up. They get cleaned up over time and should be nothing to worry about.

Hope this helps!

  • Ryan