Hi.
We’re still working extensively here with Deadline 2.0, but our single repository server is having some difficulty at times. We frequently have dozens or hundreds of active jobs and more than 300 slaves processing them. Even with the solid state disk we’re employing for our repository file system the Server 2003 system running on a single Pentium D processor is often CPU bound to the point where user interactivity is poor and the entire queuing system bogs down.
We’ve also recently made some network and studio infrastructure changes, and aside from numerous improvements we’re now confronted with an unfortunate increase in network latency between the artists and the repository server. This is resulting in a perceptible additional decrease in Monitor responsiveness under heavy load.
We finally have an opportunity to focus on addressing this issue and are considering the various options for improving the Deadline repository’s performance. The most obvious is to move the repository to a 4 or 8 core Woodcrest or Clovertown Xeon sever, but there is the possibility that the performance bottleneck might simply be moved from the CPUs back to disk I/O. In the event that this turns out to be the case I’m also curious about the potential performance benefits of migrating to Deadline 2.5 or 2.6.
From what I understand the behavior of Pulse is greatly improved in the newer Deadline versions. Could you take a moment and describe in some detail exactly what Pulse is currently doing and how using Pulse should impact repository server CPU and disk load? If there’s a white paper or other documentation on Pulse that I’ve missed that would also be very helpful. Otherwise are there any other changes to the Deadline slave, monitor and repository protocols that should reduce the disk I/O or CPU load on the repository server?
Lastly, we are also considering potentially moving the repository to a Red Hat/Samba server. I know that Deadline should be compatible with this OS/software platform, but are there any currently known issues with serving the repository file system with Samba? And are there versions of the Pulse daemon(s) for Linux?
Thanks,
Sean
Hi Sean, we have a setup similar here, hundreds of jobs in queue and we have ~130 slaves running.
With version 2.5 the load on the server when from 100% to around 10%.
We didn’t test the new pulse yet, but the speed improvement of 2.5 was a bliss!
Sylvain Berger | Technical Director | Alpha Vision
Hi Sean, we have a setup
similar here, hundreds of jobs
in queue and we have ~130
slaves running.
drool… drool…
How many 3D artists do you have there?
I’m wondering what the standard artists to farm count ratio is. I want to talk my boss into buying more slaves.
We have sround 30 3d artists, so the ratio is render nodes/user is descent.
Sylvain Berger | Technical Director | Alpha Vision
Hey Sean,
Pulse will be re-included in the Deadline 2.6 release, which should be
this week. Pulse was actually excluded from the 2.5 release because it
basically needed to be rebuilt from the ground up.
The old Pulse used a folder watcher to keep track of changes in the
Deadline repository and maintain an updated cache in memory. We found
that the folder watcher class couldn’t handle larger farms, which
completely defeated the purpose of running Pulse in the first place. All
too often, changes would go unnoticed, and Pulse would often crash due
to memory issues.
The new Pulse runs on the repository machine, and acts as a proxy
between the slaves and the repository. Pulse, by default, will check for
jobs every 10 seconds. At each interval, Pulse will take the slaves that
have connected to it in the past 10 seconds and search for jobs for them
(all communication is done via TCP). The repository is only read in once
per interval, which combined with the fact that the file IO involved is
local, greatly reduces the load on the network. We’ve been running Pulse
here for the past few weeks, and cpu usage sits around 5-10% until Pulse
reads in the repository, where it temporarily spikes up to 70-90%.
In addition, we’ve found that running Pulse has greatly improved the
consistancy at which the Monitor refreshes. The Monitor doesn’t yet
interact with Pulse, but the lightened network load certainly helps.
As far as I know, there are no known issues with running Deadline on a
Samba server. There aren’t any linux versions for any Deadline
applications yet, but after 2.6 is released, we will focus a lot of
energy on multiplatform support.
Cheers,
- Ryan
–
Ryan Russell
Frantic Films Software
http://software.franticfilms.com/
204-949-0070
Hey.
Very cool. I’ve been a proponent of using socket-based communication whenever possible and plausible for monitoring and management systems since I first switched from files to sockets for a monitoring tool on old Irix farm I was looking after. The difference in the monitoring system load was about 100-1. It sounds like the new Pulse daemon should at least cut down drastically on repository file reads.
Reducing file writes might be a bit more tricky. With the design objective that the repository must contain at all times a valid image of current render states the amount of file writing is difficult to reduce. Perhaps non-critical status information such as task percent completion could be maintained in the Pulse daemon instead of on disk?
Out of curiosity, what was changed in the Deadline protocols for Deadline 2.5 that resulted in the ~50% decrease in file I/O? Was it simply an increase in polling and updating intervals, or was render status data concentrated into fewer files?
-Sean
Hey Sean,
Yah, we haven’t really cut down on the file writing yet, simply for the
reason you’ve mentioned. I’m sure going forward though we can find ways
to do this.
When we started to get complaints regarding the amount of file IO in
Deadline 2.0, we decided to run Filemon on the repository folder with
one slave running. We were pretty much shocked at how much the
repository was getting hit, even for simply operations like
right-clicking on a job in the Monitor. We painstakingly went through
all aspects of Deadline and found ways to reduce the repository hits by
just being smart about when we needed to load data directly and when we
could read it from a cache. The power management code in 2.0 was
extremely bad for this, as was writing to the slave info file (we
discovered we were doing 2x as much file IO as necessary here).
We also converted a few xml files (like tasks and job overrides) to
empty files that store all information in their filename. Now instead of
serializing every task file for a job, we simply do a directory read to
get the files and parse their filenames. This helped a lot too.
Cheers,
- Ryan
–
Ryan Russell
Frantic Films Software
http://software.franticfilms.com/
204-949-0070
Hey.
Very interesting.
We’re beginning work on migrating to 2.5 now. Is there any word on the status of 2.6? Hopefully 2.5 will give us the performance improvement we need, and we’re not expecting to go to Max 9 in the immediate future, but while we have the time to look at Deadline it could be useful to kick 2.6’s tires as well. I’m also curious about the behavior of the latest incarnation of Pulse.
-Sean
Hey Sean,
Barring any major setbacks, we plan to do the release of Deadline 2.6
tomorrow. When migrating from 2.0, make sure to read the Upgrading
section of the manual:
http://software.franticfilms.com/index.aspx?page=deadline/upgrade
There were some changes that make Deadline 2.0 incompatible with 2.5 and
later. If you are already running 2.5 when you upgrade to 2.6, the
process will be smooth like it has been in the past. If you run into any
problems during your upgrade from 2.0, let us know!
Cheers,
- Ryan
–
Ryan Russell
Frantic Films Software
http://software.franticfilms.com/
204-949-0070
Hi.
Then is 2.6 essentially a patch to 2.5, or does it require a new license?
-Sean
Hey Sean,
Deadline 2.6 will require a new license, and you should be receiving
yours via email soon.
Cheers,
- Ryan
–
Ryan Russell
Frantic Films Software
http://software.franticfilms.com/
204-949-0070