Two fairly big dev additions on our side that drive allot of how we work with deadline.
- A read only search api that sits on top of the deadline data base.
Its written as a python flask app.
The core things we get from it are:
- Ability to search by any of the job or plugin properties, so we can look up a job by name are nuke comp version and kill it, if there is a newer version that an artists needs.
- get all the dependencies of a jobid, or look to see if the current job is a dependency of something else. this is useful when trying to auto remove jobs, that you can fine the one you want and remove all the dependencies in the tree as well.
- it allows some complex lookups too, like all the jobs by user , plugin and project.
- some of the lookups help drive our statistic collection and error aggregation.
- Error collection, error aggregation super fine detailed
Which i have been trying to make a public project for a while now.
So using the OnTaskFailEvent, we collect the basic info:
slavename, plugin, errorMessage, task_id, job_id, job_name
Every single job when submitted goes through a python class that adds context info to the extrainfo fields,
so we collect 2 extra pieces from the job:
In the event code, we then add an extra field, called an error_tag.
So if the error_message contain a “.py” then we scan it for all the known python error types, and they become the error_code.
simple things , like is it a licenses error, is it permissions, is maya missing, missing texture, these are examples of what the error_tag . this also gives us something else to aggregate fails on.
Before we add the entry to the database, we check the tag, to see if it matches something we should autoreque or send as an email to the it dept to fix.
Otherwise it gets added to the database.
Another flask app, displays all the information, in some dynamically generated pages, but we have a bunch of prebuild menu’s.
The end of this story, with out boring you all, it allows you to find a redshift texture problem and the 200 fails caused by it.
you can fix the texture, then in the web gui, either set the selected tasks to requeue or the selected jobs to requeue.
On average we have 45-50K of jobs on the farm so this stuff is pretty powerful.
on the statistics side, we collect per minute, running jobs, per episode, per project, per plugin, current burn rate, queued jobs per project, per episode, per plugin . fails, per project, per plugin , per episode.
At some point i want to migrate all the statistics data to influxdb/grafana, as right now its all mongodb/ google graphs, and some of the graphs , when you display like a month of data is a heavy html file.
Im happy to answer questions on this stuff.