Pulse, Performance, and Repository Migrations

anon5390329 · June 20, 2009, 10:25am

Hi all,

Sorry for this being a long message rather than a specific bug report… but we’d love to get any insights relevant to our situation.

Unfortunately, we got started on Deadline without a specific Pulse machine because our systems department didn’t set one up in time. The result is that our in-use repository is on a generic file server where we cannot run Pulse.
We now have about 75 slaves, and a slightly larger number of machines that submit jobs. The number of Slaves will increase next week.
For reasons that aren’t worth explaining, we have accumulated about 1700 pending jobs in the repository. They’re real jobs, so deleting them isn’t an option. Deadline Monitor now takes about 20 minutes to launch, or to update the jobs panel.
We now have a machine for running Pulse. I set up Pulse, and set the job poll interval to 1800 (30 min). However, most of the running Monitor applications report Pulse as being stalled most of the time. When you log into the local machine, Pulse appears to be running along perfectly happily. Running Pulse hasn’t had any noticeable performance impact, positive or negative.
As recommended, we would love to migrate the repository to be local to the Pulse machine. However, that involves changing the repository path, and we are in the middle of production (despite the backlog, hundreds of jobs are successfully running every day). People are continually submitting jobs, and slaves are continually taking jobs.

Some questions for anyone who might have ideas…

A. Why do the Monitor applications think Pulse is stalled most of the time? I’m sure Pulse is taking the same 20+ minutes to do a repository scan, but the interval is set longer than that. I tried to set a couple of the other timeouts related to Pulse in the Repository Settings to be higher, but it still usually thinks Pulse is stalled. Directly on the Pulse machine, there are no error messages or log messages indicating anything unusual.

B. Why isn’t Pulse making any difference to Monitor responsiveness? We had hoped (fervently) that we would see dramatically faster response times, even though Pulse would be slow to see new jobs. Pulse may or may not be making a difference to Slave job select times – it’s hard to know – but it definitely is not helping at Slave startup time; when Slaves start up they visibly take 15+ minutes to select their first job.

C. Has anybody ever been through a hot migration of a repository? Some of our tasks can take up to 12 hours depending on how the shot people submit them, and it’s simply not an option to shut down rendering for 12 hours in order to migrate the respository right now. It may be that we’re stuck with running Pulse out of the current repository until we’re through this production crunch, but if there’s a clever way to migrate (even if it’s hacky) we’d do some extra work to get responsiveness back.

Again, sorry for such a long post.

Leo

rrussell · June 21, 2009, 1:37am

Hi Leo,

Have you had a chance to check out our Network Performance Guide?
franticfilms.com/software/su … rmance.php

That might help things out. Also, if you still need to run Pulse, try setting the job poll interval to something like 20 or 30 seconds (we use 20 seconds here, and we have about 200 slaves). With it currently being set to 30 minutes, I imagine that’s why you’re not seeing any improvement whatsoever. Give this a try and let us know if you get better results.

Finally, if you need to move the repository, here’s how to do it:
franticfilms.com/software/su … moving.php

Cheers,

Ryan

anon5390329 · June 22, 2009, 12:09pm

Hi Russell,

Thanks for the tips. I’ve adjusted the Slave settings as suggested and I’m trying more reasonable values for the Pulse timeouts as well. I’ll let you know as things develop.

Leo