Problem with event scripts for requeue in Deadline Monitor. Is it possible?

yoinkobane · July 8, 2024, 3:37pm

Hello, I’ll explain my problem.

I know that to create event scripts in Deadline Monitor, you need two files: the .py and the .param files.

With that said, I need to ensure that in my render farm, when a job with its task ID and selected frames is sent to “test_pool,” if it exceeds 15 minutes of rendering, the job is automatically requeued to avoid stalls or hanging issues, etc. I am not sure if this is feasible; I have tried several things, and although the event is detected, it doesn’t fully work.

I’ve been trying with things similar to this:

.py:

from Deadline.Events import DeadlineEventListener
from Deadline.Scripting import RepositoryUtils

def GetDeadlineEventListener():
    return CustomRetryEvent()

def CleanupDeadlineEventListener(deadlinePlugin):
    deadlinePlugin.Cleanup()

class CustomRetryEvent(DeadlineEventListener):
    def __init__(self):
        self.OnJobRunningCallback += self.OnJobRunning

    def Cleanup(self):
        del self.OnJobRunningCallback

    def OnJobRunning(self, job):
        # Nodos específicos
        specific_nodes = ["render09"]
        # Pool específica
        specific_pool = "test_2d"
        # Tiempo límite en segundos (45 minutos)
        time_limit = 45 * 60

        if job.JobPool.lower() == specific_pool:
            tasks = RepositoryUtils.GetJobTasks(job, True)
            for task in tasks:
                # Verificar si la tarea está siendo ejecutada en uno de los nodos específicos
                if task.TaskStatus == "Rendering" and task.TaskSlaveName.lower() in specific_nodes:
                    if task.TaskRenderTime.TotalSeconds > time_limit:
                        # Reenviar la tarea
                        RepositoryUtils.RequeueTasks(job, [task])

.param:

EventName=CustomRetryEvent
ScriptFile=CustomRetryEvent.py
Enabled=True

Derek_E_Zavada · July 10, 2024, 4:47pm

Hi
I’ve reviewed your code and the documentation, but I couldn’t find an OnJobRunningCallback mentioned anywhere. In the past, I faced a similar situation and ended up writing an external watchdog program, which worked fine for me. However, I agree that running it internally would be easier and cleaner. You might want to try using OnJobStartedCallback or OnSlaveStartedCallback instead.

https://docs.thinkboxsoftware.com/products/deadline/10.3/1_User%20Manual/manual/event-plugins.html

yoinkobane · July 11, 2024, 9:50am

hi, thank you for answering.

I changed it to OnJobStartedCallback

but it still doesn’t respond, I don’t know if the problem is in the .param

from Deadline.Events import DeadlineEventListener
from Deadline.Scripting import RepositoryUtils

def GetDeadlineEventListener():
    return CustomRetryEvent()

def CleanupDeadlineEventListener(deadlinePlugin):
    deadlinePlugin.Cleanup()

class CustomRetryEvent(DeadlineEventListener):
    def __init__(self):
        self.OnJobStartedCallback += self.OnJobStarted

    def Cleanup(self):
        del self.OnJobStartedCallback

    def OnJobStarted(self, job):
        # Pool específica
        specific_pool = "test_2d"
        # Tiempo límite en segundos (15 minutos)
        time_limit = 5 * 60

        if job.JobPool.lower() == specific_pool:
            tasks = RepositoryUtils.GetJobTasks(job, True)
            for task in tasks:
                # Verificar si la tarea está siendo ejecutada y si ha excedido el tiempo límite
                if task.TaskStatus == "Rendering":
                    if task.TaskRenderTime.TotalSeconds > time_limit:
                        # Reenviar la tarea
                        RepositoryUtils.RequeueTasks(job, [task.TaskId])

.param:

EventName=CustomRetryEvent
ScriptFile=CustomRetryEvent.py
Enabled=True

Justin_B · July 11, 2024, 1:45pm

Would setting a job timeout automatically work for you?

I’d do it in an OnJobSubmitted event, and use this event as an example. The only issue is I’m not finding a way in the API to set the “OnTaskTimeout=Requeue” property on an already created job.

As for troubleshooting your existing event - your .param isn’t correct. Check out this page for an example and a list of valid options.

yoinkobane · July 11, 2024, 2:15pm

Hi Justin, thanks for the help.

The problem with job timeout is that it applies to everything, and I only want to apply it to a specific pool.

Justin_B · July 11, 2024, 2:20pm

Oh, you’re thinking of automatic job timeout, that’ll be set farm-wide. But the other one I posted can be set on a job-by-job basis.

You could do it manually in the Monitor by double clicking the job and going to ‘timeouts’ and adjusting settings there if you’d like to test first without having to write code.

Derek_E_Zavada · July 11, 2024, 4:08pm

I would also add some logging so you can see what’s happening in the console.

def OnJobStarted(self, job):
    self.LogInfo(f"OnJobStarted {job.JobId}")
    # Pool específica
    specific_pool = "test_2d"
    # Tiempo límite en segundos (15 minutos)
    time_limit = 5 * 60

Sample .param file

[State]
Type=Enum
Items=Global Enabled;Opt-In;Disabled
Label=State
Default=Disabled
Description=Time out