AWS Thinkbox Discussion Forums

Yet more repository questions

Hi.



We’re still working extensively here with Deadline 2.0, but our single repository server is having some difficulty at times. We frequently have dozens or hundreds of active jobs and more than 300 slaves processing them. Even with the solid state disk we’re employing for our repository file system the Server 2003 system running on a single Pentium D processor is often CPU bound to the point where user interactivity is poor and the entire queuing system bogs down.



We’ve also recently made some network and studio infrastructure changes, and aside from numerous improvements we’re now confronted with an unfortunate increase in network latency between the artists and the repository server. This is resulting in a perceptible additional decrease in Monitor responsiveness under heavy load.



We finally have an opportunity to focus on addressing this issue and are considering the various options for improving the Deadline repository’s performance. The most obvious is to move the repository to a 4 or 8 core Woodcrest or Clovertown Xeon sever, but there is the possibility that the performance bottleneck might simply be moved from the CPUs back to disk I/O. In the event that this turns out to be the case I’m also curious about the potential performance benefits of migrating to Deadline 2.5 or 2.6.



From what I understand the behavior of Pulse is greatly improved in the newer Deadline versions. Could you take a moment and describe in some detail exactly what Pulse is currently doing and how using Pulse should impact repository server CPU and disk load? If there’s a white paper or other documentation on Pulse that I’ve missed that would also be very helpful. Otherwise are there any other changes to the Deadline slave, monitor and repository protocols that should reduce the disk I/O or CPU load on the repository server?



Lastly, we are also considering potentially moving the repository to a Red Hat/Samba server. I know that Deadline should be compatible with this OS/software platform, but are there any currently known issues with serving the repository file system with Samba? And are there versions of the Pulse daemon(s) for Linux?





Thanks,



Sean

Hi Sean, we have a setup similar here, hundreds of jobs in queue and we have ~130 slaves running.



With version 2.5 the load on the server when from 100% to around 10%.



We didn’t test the new pulse yet, but the speed improvement of 2.5 was a bliss!



Sylvain Berger | Technical Director | Alpha Vision


Hi Sean, we have a setup

similar here, hundreds of jobs

in queue and we have ~130

slaves running.



drool… drool…



How many 3D artists do you have there?



I’m wondering what the standard artists to farm count ratio is. I want to talk my boss into buying more slaves.

We have sround 30 3d artists, so the ratio is render nodes/user is descent.



Sylvain Berger | Technical Director | Alpha Vision


Hey Sean,



Pulse will be re-included in the Deadline 2.6 release, which should be

this week. Pulse was actually excluded from the 2.5 release because it

basically needed to be rebuilt from the ground up.



The old Pulse used a folder watcher to keep track of changes in the

Deadline repository and maintain an updated cache in memory. We found

that the folder watcher class couldn’t handle larger farms, which

completely defeated the purpose of running Pulse in the first place. All

too often, changes would go unnoticed, and Pulse would often crash due

to memory issues.



The new Pulse runs on the repository machine, and acts as a proxy

between the slaves and the repository. Pulse, by default, will check for

jobs every 10 seconds. At each interval, Pulse will take the slaves that

have connected to it in the past 10 seconds and search for jobs for them

(all communication is done via TCP). The repository is only read in once

per interval, which combined with the fact that the file IO involved is

local, greatly reduces the load on the network. We’ve been running Pulse

here for the past few weeks, and cpu usage sits around 5-10% until Pulse

reads in the repository, where it temporarily spikes up to 70-90%.



In addition, we’ve found that running Pulse has greatly improved the

consistancy at which the Monitor refreshes. The Monitor doesn’t yet

interact with Pulse, but the lightened network load certainly helps.



As far as I know, there are no known issues with running Deadline on a

Samba server. There aren’t any linux versions for any Deadline

applications yet, but after 2.6 is released, we will focus a lot of

energy on multiplatform support.



Cheers,

Hey.



Very cool. I’ve been a proponent of using socket-based communication whenever possible and plausible for monitoring and management systems since I first switched from files to sockets for a monitoring tool on old Irix farm I was looking after. The difference in the monitoring system load was about 100-1. It sounds like the new Pulse daemon should at least cut down drastically on repository file reads.



Reducing file writes might be a bit more tricky. With the design objective that the repository must contain at all times a valid image of current render states the amount of file writing is difficult to reduce. Perhaps non-critical status information such as task percent completion could be maintained in the Pulse daemon instead of on disk?



Out of curiosity, what was changed in the Deadline protocols for Deadline 2.5 that resulted in the ~50% decrease in file I/O? Was it simply an increase in polling and updating intervals, or was render status data concentrated into fewer files?



-Sean

Hey Sean,



Yah, we haven’t really cut down on the file writing yet, simply for the

reason you’ve mentioned. I’m sure going forward though we can find ways

to do this.



When we started to get complaints regarding the amount of file IO in

Deadline 2.0, we decided to run Filemon on the repository folder with

one slave running. We were pretty much shocked at how much the

repository was getting hit, even for simply operations like

right-clicking on a job in the Monitor. We painstakingly went through

all aspects of Deadline and found ways to reduce the repository hits by

just being smart about when we needed to load data directly and when we

could read it from a cache. The power management code in 2.0 was

extremely bad for this, as was writing to the slave info file (we

discovered we were doing 2x as much file IO as necessary here).



We also converted a few xml files (like tasks and job overrides) to

empty files that store all information in their filename. Now instead of

serializing every task file for a job, we simply do a directory read to

get the files and parse their filenames. This helped a lot too.



Cheers,

Hey.



Very interesting.



We’re beginning work on migrating to 2.5 now. Is there any word on the status of 2.6? Hopefully 2.5 will give us the performance improvement we need, and we’re not expecting to go to Max 9 in the immediate future, but while we have the time to look at Deadline it could be useful to kick 2.6’s tires as well. I’m also curious about the behavior of the latest incarnation of Pulse.



-Sean

Hey Sean,



Barring any major setbacks, we plan to do the release of Deadline 2.6

tomorrow. When migrating from 2.0, make sure to read the Upgrading

section of the manual:

http://software.franticfilms.com/index.aspx?page=deadline/upgrade



There were some changes that make Deadline 2.0 incompatible with 2.5 and

later. If you are already running 2.5 when you upgrade to 2.6, the

process will be smooth like it has been in the past. If you run into any

problems during your upgrade from 2.0, let us know!



Cheers,

Hi.



Then is 2.6 essentially a patch to 2.5, or does it require a new license?



-Sean

Hey Sean,



Deadline 2.6 will require a new license, and you should be receiving

yours via email soon.



Cheers,

Privacy | Site terms | Cookie preferences