Jobs/tasks not getting assigned to slaves quickly

anon3959927 · March 12, 2013, 3:17pm

A problem that’s been dogging us for the past little while has been that a lot of the render farm is sitting idle while jobs are queued in Deadline. The jobs have valid pool assignments with respect to the idle machines, but sometimes it takes an hour for the tasks of these jobs to start rendering on any machine in the farm. Archiving of completed jobs is happening hourly, and no job goes without archiving for longer than 48 hours. We will be migrating the repository to SSD drives in a week or so, but is there anything else we can do to make the farm more efficient in picking up available tasks?

rrussell · March 12, 2013, 3:36pm

How many archived jobs do you have on the farm? These impact Pulse’s performance when searching for jobs for slaves, so it’s possible that too many of them can cause Pulse to take too long, which can cause slaves to search for jobs themselves, which can slow things down overall.

These issues have been resolved in Deadline 6, but there are a few things you can do to deal with this for now.

In your Repository Options, disable the ability for slaves to look for jobs themselves if they have issues connecting to Pulse. You can find this under Slave Settings in the repo options. With this disabled, if Pulse falls behind, the slaves won’t compound the problem.
If you don’t need to archive jobs, set them to auto-delete after 48 hours instead of auto-archiving. If you need to archive jobs, maybe configure your Repository options to delete archived jobs after a week or so to keep things clean.

That should help.

Cheers,

Ryan

anon3959927 · March 18, 2013, 3:56pm

Last week a script was deployed that automatically cleans up completed/suspended/failed jobs on a more aggressive schedule, such that the total number of non-archived jobs in the repository is around 1000, and never higher than 2000. Slaves have been disabled from searching for jobs since before that script was deployed.

During this period, we’re still seeing underutilization of the farm as well as long delays (30 minutes or more) for some jobs in getting picked up for rendering, even though machines for which they have valid pools assignments are sitting idle. Are there any other things we can do? We’re using Deadline 5.

rrussell · March 18, 2013, 7:25pm

Strange…

If you don’t already have it enabled, can you enable Slave Verbose Logging in the Application Logging section of the Repository Options. Note that after enabling it, it can take the slaves up to 10 minutes to recognize it. You can restart the slaves to have them recognize the change immediately.

Then the next time a slave is sitting idle when it should be rendering, send us the slave’s log and we’ll have a look.

Also, maybe it wouldn’t hurt to enable Pulse verbose logging to, and send us the pulse log. We can see if it has any indication that it is having trouble servicing slaves.

Finally, which version of 5? Version 5.2.49424, which was released in December, fixed a key bug in Pulse that occurred when the same slave tried to make a second connection to it. Before this release, Pulse would throw an error, but then failed to respond to the new connection properly. If you are running a previous version of 5, that could explain why it takes the slaves so long to eventually get a job.

Cheers,

Ryan

anon82926837 · March 20, 2013, 1:34pm

what is your setting for Repository Scan Interval in Pulse Settings?