Manually adding an Error to a Task via script - I need you fam :)

Heinrich_Haselnuss · October 22, 2019, 11:50am

Hey, Guys

Its Mr. Hazelnut,

Since Deadline offers a feature to set tasks to status failed after a specific amount of errors happened, IS it also possible to add an error to a task manually by script?

like :

if activeRenderingTask.time > 2 hours
task Error ++
add Task to RequeueList
Requeue all Tasks in RequeueList

and in the End, the Deadline feature takes care of setting it to failed after X Errors.

I only see a brute force alternative of saving per TaskID key a specific counter as a value. This value can be increased, compared and if it hits a limit it sets taskStatus = failed. But this is a thing which happens in the background and there is no real chance of keeping an eye on it via log or any UI service.

Thanks a lot in advance

yours,
Mr. Hazelnut

Justin_B · October 22, 2019, 2:51pm

From what I can find in the scripting docs, there’s nothing about increasing the error count. I suppose you could throw your own error - but I’ll need to do some testing to see exactly how it would work.

Now, if your example is exactly what you’re trying to do there is the Job property TaskTimeoutSeconds which is the number of seconds a Worker has to render a task for this job before an error is reported, and the task is requeued. Specify 0 for no limit. ( That’s lifted from the docs here though I did quote the name as it appears in the JobInfo file).

If TaskTimeout doesn’t do what you need, let me know and I’ll try to get some time for experimenting into my calendar.

kwatts · October 22, 2019, 3:04pm

Hi @Heinrich_Haselnuss,

As @Justin_B has suggested, TaskTimeoutSeconds is the easiest way to have the custom timeout.

in order to capture the jobs to requeue later, you could create a OnJobTaskFail event.

This is a snippet taken from our OnJobTaskFail event, that should give you enough of a direction to go in:

from System import *

from Deadline.Events import *
from Deadline.Scripting import *

def GetDeadlineEventListener():
    return error_db_collect_failed()


def CleanupDeadlineEventListener(eventListener):
    eventListener.Cleanup()


class error_db_collect_failed(DeadlineEventListener):
    '''
        When a job is marked as completed or is deleted, remove all the error entries in the db.
        this is respocible for cleaning up the db.

    '''

    def __init__(self, ):
        self.OnJobErrorCallback += self.OnJobTaskFail

    def Cleanup(self):
        del self.OnJobErrorCallback

    # wrappers
    def OnJobTaskFail(self, job, task, report):
        # self.LogInfo("OnJobTaskFail:: OnJobFinished. %s" % job.ID)
        
        error_tag = ''
        job_id = job.JobId
        job_name = job.JobName

        errorMessage = report.ReportError
        errorMessage = errorMessage.strip()


        # tag timeouts:
        if 'The Slave did not complete the task before the Regular' in errorMessage:
            error_tag = 'SLAVE_TIMEOUT'
            # do something eith the job_id and job_name

We forward the job fails off to a mongodb, where we store each task with the same document id, that the deadline db does for tracking tasks. However, this could be a text file that you append jobids to.

The idea behind the error_tag varible set in our event, is this allows us to super quickly aggregate errors , and hence be able to re-queue task or jobs quickly.

Happy to further elaborate on our setup, but i think this should get you to where you want go.

Hope this helps.
Kym