Your own favourite deadline improvement scripts - how to improve reliability and stability

hey guys,

I am trying to make our Deadline workflow more reliable and less faily.

We already fixed small issues and built many work around features.

  • sometimes a task of a job freezes to infinity at 99% job progression.
    -> we fixed it by integrating a skript into the house cleaning mechanism checking jobprogression and task rendering time to auto-requeue tasks which take too long.
  • sometimes nodes seem to produce errors and ram runs full
    -> so we integrated a housecleaning skript to restart our nodes after they rendered a certain amount of tasks.

I wrote this post to ask you guys:

What is a skript you integrated into deadline which feels like a necessary workflow and reliability and stability improvement to you?

I am looking for some nice advices and tips to share and discuss. Also I am open to share approaches how we solved our problems.


To deal with your first issue we do have automatic timeouts that can be applied to jobs.

They’ll be generated dynamically based on the completion time of the other tasks in the job. You can read more about them in the docs here.

As for the second issue, assuming those errors are unavoidable, that’s a nice solution!

Great topic! Looking forward to seeing some tips.

Two fairly big dev additions on our side that drive allot of how we work with deadline.

  1. A read only search api that sits on top of the deadline data base.
    Its written as a python flask app.
    The core things we get from it are:
  • Ability to search by any of the job or plugin properties, so we can look up a job by name are nuke comp version and kill it, if there is a newer version that an artists needs.
  • get all the dependencies of a jobid, or look to see if the current job is a dependency of something else. this is useful when trying to auto remove jobs, that you can fine the one you want and remove all the dependencies in the tree as well.
  • it allows some complex lookups too, like all the jobs by user , plugin and project.
  • some of the lookups help drive our statistic collection and error aggregation.
  1. Error collection, error aggregation super fine detailed
    Which i have been trying to make a public project for a while now.

So using the OnTaskFailEvent, we collect the basic info:
slavename, plugin, errorMessage, task_id, job_id, job_name

Every single job when submitted goes through a python class that adds context info to the extrainfo fields,
so we collect 2 extra pieces from the job:

In the event code, we then add an extra field, called an error_tag.
So if the error_message contain a “.py” then we scan it for all the known python error types, and they become the error_code.
simple things , like is it a licenses error, is it permissions, is maya missing, missing texture, these are examples of what the error_tag . this also gives us something else to aggregate fails on.

Before we add the entry to the database, we check the tag, to see if it matches something we should autoreque or send as an email to the it dept to fix.

Otherwise it gets added to the database.

Another flask app, displays all the information, in some dynamically generated pages, but we have a bunch of prebuild menu’s.

The end of this story, with out boring you all, it allows you to find a redshift texture problem and the 200 fails caused by it.
you can fix the texture, then in the web gui, either set the selected tasks to requeue or the selected jobs to requeue.

On average we have 45-50K of jobs on the farm so this stuff is pretty powerful.

on the statistics side, we collect per minute, running jobs, per episode, per project, per plugin, current burn rate, queued jobs per project, per episode, per plugin . fails, per project, per plugin , per episode.

At some point i want to migrate all the statistics data to influxdb/grafana, as right now its all mongodb/ google graphs, and some of the graphs , when you display like a month of data is a heavy html file.

Im happy to answer questions on this stuff.