Python : Delete Errors + reset to zero

justin · September 24, 2020, 1:46pm

Hi all.

I am attempting to write up a utility script to help speed up some of the tasks I perform regularly. At the moment I need to blacklist any slaves that have failed the task, delete all error reports and resume the failed job.

I am aware that there is already a system for this within deadline that does roughly the same thing, but this has been a great opportunity to learn some more about writing scripts in deadline.

I’ve got all the bones worked out, at the moment the stumbling block is that I cannot reset the error counter to zero. I am able to delete all error reports on the jobs themselves but the error counter doesn’t change. Any help would be greatly appriciated!

Here’s what I have so far:

import os
import sys

from Deadline.Scripting import *
from Deadline.Jobs import *

from DeadlineUI.Controls.Scripting.DeadlineScriptDialog import DeadlineScriptDialog


def __main__(*args):
    selectedJobs = MonitorUtils.GetSelectedJobs()

    if len(selectedJobs) > 0:
        # collect bad slaves on all jobs + blacklist them on all selected jobs.
        slaves_to_blacklist = []
        for job in selectedJobs:
            for task in RepositoryUtils.GetJobTasks(job, True):
                # If the task has failed, add it to the bad list.
                if task.TaskStatus == 'Failed':
                    slaves_to_blacklist.append(task.TaskSlaveName)

        # Iterate over all the jobs and set the slaves to blacklist.
        for job in selectedJobs:
            if job.JobStatus == 'Failed':
                RepositoryUtils.AddSlavesToMachineLimitList(job.ID, slaves_to_blacklist)
                RepositoryUtils.DeleteAllJobReports(job.ID)
                RepositoryUtils.ResumeFailedJob(job)

Justin_B · October 2, 2020, 12:44pm

That looks like it should be working!

I haven’t tried this, but if you add RepositoryUtils.SaveJob(job) after you run DeleteAllJobReports(job.ID) does that make the difference?

justin · October 8, 2020, 7:48am

Hey Justin, thanks for the response!

Sadly it doesn’t look like it’s worked. We are on an older deadline version if that helps at all?

Deadline Client Version: 10.0.24.4
Repository Version: 10.0.24.4

Justin_B · October 8, 2020, 5:38pm

Not sure - I’m getting the same behavior with your script on 10.1.9.

I can’t figure a way around this - even looking at the way the Monitor does it. I suppose you could instead re-queue the job after adding those workers to the deny list.

Additionally (and you may already be doing this) you could automatically add workers to the denylist after they’ve created some number of errors. Set that in the Monitor under Tools->Configure Repository Options-> Job Settings ->Failure Detection

That would at least get around problem machines generating a huge amount of errors on your jobs.