AWS Thinkbox Discussion Forums

Wishlist blacklisting tools

We have some ongoing problems with our IT department and permissions that causes intermittent failures of Maya renders. I am finding that there are some features within Deadline that would be really useful to have to manage such situations:

  1. It would be useful to have the ID of a failed node remain visible under the “Slave” column, in the same way that currently rendering tasks show it, until another node picks up the task. The only way to ID the failed machine I’m aware of is to open the task report. I’m constantly being asked to ID failed nodes, and this would make it much more transparent.

  2. The easy ability to blacklist failed nodes from the task list. It seems like the only way to blacklist currently is to View Task Report for each and every packet, and RMB blacklist failed nodes from there. It would be much easier if this could be done from the task list directly.

  3. Is there no feature to simply blacklist all failed nodes on any job while it is still running?

  4. Is there no feature to list all failed nodes in a job until after the job is completed and you can look at history?

If any of these features already exist I’d love to know about them.

Sounds like you’re looking for this? docs.thinkboxsoftware.com/produ … ction.html

Yarp, that’s the one I was going to recommend. :slight_smile:

That addresses some of the points I’ve made, but not all of them. Slave failure detection is of course all well and good, but still requires that individual slaves fail before they are blacklisted.

This also doesn’t provide the immediate transparency of failed slaves I mentioned, which it would be useful to see in the Task List.

Also, since we are using Maya and the Maya submission script, there is no easy way to set the slave failure count without first launching the job (that I’m aware of).

And we are using multiple applications here (Nuke, Maya, AE, Cinema) and keeping track of which failures pertain to which applications is also difficult. I’m only interested in failures on my specific jobs, so having feedback that is a little more job-centric would be helpful.

IMHO the menu set within Deadline could be streamlined a little more for direct use by artists, since it’s frequently incumbent on us to provide IT with detailed information about failures and problems. The assumption seems to be that Deadline is used by render wranglers or those who have time to do deep dives into the UI to figure out where all these settings and feedback live.

Hello,

If you wanted immediate feedback for when errors are occurring and you wanted to forward them as email you could take advantage of the on task fail event:

###############################################################
# Imports
###############################################################
from System import *

from Deadline.Events import *
from Deadline.Scripting import *

import sys
import re

import datetime

import os

# these may not be avalible in your path, there might be better libs to use
import smtplib



##################################################################################################
# This is the function called by Deadline to get an instance of the Draft event listener.
##################################################################################################
def GetDeadlineEventListener():
    return on_task_fail()


def CleanupDeadlineEventListener(eventListener):
    eventListener.Cleanup()


###############################################################
# The event listener class.
###############################################################
class on_task_fail(DeadlineEventListener):
    '''
        When a job is marked as completed or is deleted, remove all the error entries in the db.
        this is respocible for cleaning up the db.

    '''

    def __init__(self, ):
        self.OnJobErrorCallback += self.OnJobTaskFail

    def Cleanup(self):
        del self.OnJobErrorCallback

    # wrappers
    def OnJobTaskFail(self, job, task, report):
        # self.LogInfo("OnJobTaskFail:: OnJobFinished. %s" % job.ID)
        #
        self.LogInfo("OnJobTaskFail:: print out all the details we have access to")
        


        cur_time = datetime.datetime.now()
        extra_data = {'current_time': cur_time.strftime("%Y-%m-%d %H:%M:%S")}
        error_tag = 'FAIL'
        slave_name = task.TaskSlaveMachineName  # good
        errorMessage = report.ReportError
        errorMessage = errorMessage.strip()
        plugin_name = job.JobPlugin
        task_id = task.TaskId

        job_id = job.JobId
        job_name = job.JobName
        
        
        if errorMessage == '':
            # we have a method that will lookup the task log via rest/webservices, and get the last log line, if the event cant access it
            #errorMessage = .retrieve_last_error_from_rest(job_id, task_id)

            pass

        ####### remove the date stamp at the beginig of the error line:
        if errorMessage:
            errorMessage = errorMessage.strip()
            if '[' in errorMessage:
                match = re.search(r"\[.*?\]", errorMessage)
                if match.group() and match.group().startswith('[20'):
                    errorMessage = errorMessage.replace(match.group(), '').strip()



        self.LogInfo("OnJobTaskFail:: check that we have access to the information we need:")
        self.LogInfo("slave_name: %s " % slave_name)
        self.LogInfo("errorMessage: %s " % errorMessage)
        self.LogInfo("plugin_name: %s " % plugin_name)
        self.LogInfo("task: %s " % task_id)
        self.LogInfo("job_id: %s " % job_id)
        self.LogInfo("job_name: %s " % job_name)
        self.LogInfo("OnJobTaskFail:: finished running")
        # license error:
        if 'error: Could not obtain a license' in errorMessage:
            # hacky for now. need to make regex but... brain no work.
            bits = errorMessage.split('error: Could ')
            errorMessage = 'error: Could %s' % bits[-1]
            error_tag = 'LICENSE'

        if 'not found in the semicolon separated lis' in errorMessage:
            # maya is missing from the machine.
            error_tag = 'MAYA_MISSING'


        disabled = False
        # disable slave criteria, ie, if you want certin error types to cause the slave to be disabled  until they are fixed.
        if error_tag in ['MAYA_MISSING']:
            disabled = True

             slaveSettings = RepositoryUtils.GetSlaveSettings(slave_name, True)
             slaveSettings.SlaveEnabled = False
	     # while not required, you may want to add a comment to the comment field to say why the slave was disabled like:
             slaveSettings.SlaveComment = "Disabled till IT fix errors"
             RepositoryUtils.SaveSlaveSettings(slaveSettings)



        ##########################################################################################################
        
        email_to = ['theit_dept@mycompany.com']
        email_from = "Thefarm@mycompany" # this doesnt need to exist, but is usefull if you want to filter on it.
        mail_subject  = "Slave: %s Error: %s" % (slave_name,error_tag)

        mail_msg = 'slave name: %s \n' % slave_name
        mail_msg = mail_msg + 'plugin_name: %s \n' % plugin_name
        mail_msg = mail_msg + 'job_id: %s \n' % job_id
        mail_msg = mail_msg + 'task_id: %s \n' % task_id
        mail_msg = mail_msg + 'job_name: %s \n' % job_name
        mail_msg = mail_msg + 'error_tag: %s \n' % error_tag
        mail_msg = mail_msg + 'raw error msg: %s \n' % errorMessage
        



        msg = MIMEText(mail_msg)
        msg['Subject'] = mail_subject
        msg['From'] = email_from
        msg['To'] = ', '.join(email_to)

        #print msg

        try:
          smtpObj = smtplib.SMTP(" < your smtp mail server> ",timeout=120)
          smtpObj.sendmail(email_from, email_to, msg.as_string())
          smtpObj.quit()
          print "Successfully sent email"
          return True
        except :
          print "Error: unable to send email"
          return False




        return

In a nut shell, this will run when ever a task fails and send an email with info to the folks in the <email_to> list.
This comes from an event that we rely on to aggregate failure information on our farm, i have removed the prop stuff and replaced it with an emailer, so you can spam your it dept, with as much info as you need :wink:

This is a little more advanced, than what your after, but this kind of customizable plugins and events is why deadline as a packages is so amazing to use.

Hope this helps.

Cheers
Kym

Nice one Kym! I would just add on the email front, if you want it keep it simple, you can use this DeadlineCommand to send an email (which hooks into your already configured Deadline Repo email settings):

DeadlineCommand10 --help SendEmail

SendEmail Sends an email. [to <Email>] TO email address [subject <Subject>] The subject [message <Message>] The message, or the path to the file that contains the message [cc <Email>] CC email address (optional) [attach <Attachment>] Attachment file (optional)

We ship with an example of this above command in the “JobTransfer.py” plugin: “…/DeadlineRepository10/plugins/JobTransfer/JobTransfer.py”

Thanks! I will definitely forward this on to our IT guys.

Privacy | Site terms | Cookie preferences