Timeout limit lowers the priority job when reached. Is that possible?

nomte · November 13, 2019, 11:52am

Hi there!
I have been trying to find a solution for this. I want to lower the priority of jobs that hits the timeout limit, so they can be done but after all all regular jobs.
Any advice on this?

Thanks in advance,
Alo

nomte · November 19, 2019, 8:29am

If someone knows positively that it is not possible, it would be also good to know.

Thanks!

Justin_B · November 19, 2019, 11:15pm

There isn’t a specific event for a job timing out, based on the list here. But there is a listener for failed jobs, which might be something you could use to build off of.

Otherwise there’s nothing jumping out at me. I’ll post back in here if something occurs to me.

panze · November 21, 2019, 7:11am

We use “OnHouseCleaningCallback” to check for stalled frames (frames in 0% progress that have gone longer than average frame time for last 10 finished frames near it) and re-queue them. Also the same process sends our dedicated slack channel a warning message if it detects frame times that exceed our set threshold (only once per job). You could use “OnHouseCleaningCallback” to dynamically change the job priority.

nomte · November 22, 2019, 8:36am

Thank you Justin_B and panze! We are quite busy around here, but I’ll try both ideas ASAP and be back with the results. The error listener could be the solution.
For more details, the 0% progress cleaning is a good solution too, but in our farm we are running Vray and Corona. The problem is that Corona can’t tell to deadline the progress of his task, so all tasks looks like 0% until they are suddenly 100% done.

I’ll be back with results (or more questions hehe)

nomte · November 22, 2019, 5:13pm

Following the documentation and soime samples from GitHub I manage to make this script (you can laugh, coding is not my thing yet ):

This is ReQueueMe.py at Repository\custom\events\ReQueueMe

from Deadline.Events import *

################################
#This is the function that Deadline calls to get an instance of the main DeadlineEventListener class.
################################

def GetDeadlineEventListener():
return ReQueueMe()

################################
#This is the function that Deadline calls when the event plugin is no longer in use so that it can get cleaned up.
################################

def CleanupDeadlineEventListener( deadlinePlugin ):
deadlinePlugin.Cleanup()

################################
#This is the main DeadlineEventListener class for ReQueueMe.
################################

class ReQueueMe (DeadlineEventListener):
def __init__(self):
    # Set up the event callbacks here
    self.OnJobErrorCallback += self.OnJobError
  
def Cleanup(self):
    del self.OnJobErrorCallback		
    
def OnJobError(self, job):
    # TODO: Connect to pipeline site to notify it that a job has been submitted
    # for a particular shot or task.
    new_TimeOut = self.GetConfigEntry("NewTimeout")
    new_Priority = self.GetConfigEntry("NewPriority")
    self.LogInfo( "ReQueueMe triggered" )
    job.JobTaskTimeoutSeconds = new_TimeOut * 60
    job.JobPriority = new_Priority        
    RepositoryUtils.SaveJob(job)

This is ReQueueMe.param

[State]
Category=Options
CategoryOrder=0
CategoryIndex=1
Type=Enum
Items=Global Enabled;Opt-In;Disabled
Label=State
Default=Global Enabled
Description=How this event plug-in should respond to events. If Global, all jobs and Slaves will trigger the events for this plugin. If Opt-In, jobs and Slaves can choose to trigger the events for this plugin. If Disabled, no events are triggered for this plugin.

[NewTimeout]
Type=integer
Category=Options
CategoryOrder=0
CategoryIndex=2
Label=New Timeout
Default=0
Description=New time limit (in minutes) assigned to previously TimedOut jobs. 0 = No limit (default)
Maximum=50000
Minimum=0

[NewPriority]
Type=integer
Category=Options
CategoryOrder=0
CategoryIndex=3
Label=New priority
Default=1
Description=New priority assigned to the TimedOut jobs
Maximum=50
Minimum=0

However I have the following error:

2019-11-22 17:55:00: An error occurred in the “OnJobError” function in events plugin ‘ReQueueMe’: TypeError : OnJobError() takes exactly 2 arguments (4 given) (Python.Runtime.PythonException)
2019-11-22 17:55:00: (Deadline.Events.DeadlineEventPluginException)
2019-11-22 17:55:00: bei Deadline.Events.DeadlineEventPlugin.b(String aon, Exception aoo)
2019-11-22 17:55:00: bei Deadline.Events.DeadlineEventPlugin.OnJobError(Job job, Task task, Report errorReport)
2019-11-22 17:55:00: bei Deadline.Events.DeadlineEventManager.OnJobError(Job job, Task task, Report errorReport, DataController dataController)
2019-11-22 17:55:00: ---------- Inner Stack Trace (Python.Runtime.PythonException) ----------

I tried to find a solution, but with no success. Any idea what could it be?
Thank you very much in advance

Justin_B · November 25, 2019, 4:09pm

Python’s complaining that your OnError only accepts 2 arguments, but is getting given 4. Hence it blows up on you.

If you check the scripting reference for OnJobError you’ll see that it’s using 3 parameters - Job, Task, and errorReport. So once you add in the self that each Python function gets passed (for reasons) you’ve got your 4!

So you need to change your def OnJobError(self, job): into def OnJobError(self, job, task, errorReport):. Now I didn’t know you got the error report in an OnJobError callback which gives us some extra potential. I haven’t tried it, but I bet a person could parse the error report to see what the cause of the error was, and act accordingly.

Otherwise great work! There’s a lot of power at your fingertips here!

nomte · November 26, 2019, 5:12pm

thank you very much for the detailed explanation Justin!
I will try it ASAP. The fact that we got also the ErrorMessage is great, since I only want it react to “TaskTimeout” error.
Looking forward testing

Justin_B · November 26, 2019, 6:41pm

Let us know how it goes! And if you’re comfortable you could even share what you’ve made. I’m sure someone in the future would be very happy to use your stuff as a jumping-off point if you get it working.

nomte · November 28, 2019, 9:18am

Hi again,

It is working like a charm! (not at my first try hehe). The ReQueueMe event options allows me to automatically requeue a error job with a new priority, a new timeout (if any) and assignation to a new pool (we use a pool to prioritize critical jobs). Attached are the py and the param files.
ReQueueMe.zip (1.3 KB)

What I am trying to do now is to trigger the event only with the OnTaskTimeout error. I was trying to parse the errorReport in order to filter the execution, but had no success and I was unable to find specific info about it.

If I add the line:

self.LogInfo(errorReport)

I have the following error

Event Error (OnJobError): TypeError : No method matches given arguments

Any clue on how to proceed?
Thank you very much for your help!

Alo

Justin_B · November 28, 2019, 3:44pm

So that error is Python saying some method didn’t know what to do with what it was given. self.LogInfo expects to get some text to print to the log.

Something I had neglected to think about, is that errorReport is an object with properties! If you check out the errorReport docs you’ll be able to see all the properties we’re able to use on that errorReport object.

So you’ll probably need to replace self.LogInfo(errorReport) with self.LogInfo(errorReport.ReportMessage).

For a little bonus info, LogInfo() expects a string as input. So be aware of your output’s type, there might be a little massaging to do before it’s happy with what you’re giving it.

Give that all a shot and let us know how it goes for you!

nomte · December 2, 2019, 10:12am

Finally done! It only reacts to a timeout error. It works and I am happy, however, something I am not really proud is the fact that it reacts to the error message as a string. I would prefer reacting to a constant error ID* or something more consistent. The day that message is modified the event will not trigger anymore. But, meanwhile, we are good.

In case it is useful for anyone, here are the files:
ReQueueMe_v02.zip (1.4 KB)

Thank you very much Justin for the help and support

Best,
Alo

*I can retrieve the errorReport.ReportID property, but it seems to be an one-time unique ID related to the JobID.

Justin_B · December 2, 2019, 3:43pm

While we’re thinking about strings changing, we’re working on going from ‘Slave’ to ‘Worker’. So there’s a good chance that’ll change at some point. Keep an eye out for it during the 10.1 release(s).

I’d like to have a unique ErrorID, but I can’t think of a way for us to implement it without indexing all potential errors. Which would get out of hand very quickly.

Otherwise, great work! I hope you’ve got some more ideas for automating stuff in Deadline!