AWS Thinkbox Discussion Forums

Delay between task retries?

Is there any way to introduce a delay between retries of a task? I haven’t found anything in the repository options or the list of possible JobInfo keys.

There isn’t. What would the delay be used for? Would you just want the slave to “idle” during the delay?

No, I would just want to prevent the task from starting up again right away; the slave would be free to try something else. This gives people a chance to catch degenerate error cases without burning excess slave time hammering away at the same doomed task.

An example would be one of Maya’s many bugs writing its own .ma scene files: sometimes it just straight-up forgets to escape quotes in attribute values, causing an error when loading the scene. If this scene is referenced several levels deep into a heavy hierarchy, the load time for the scene could be ~10 minutes or more. If the task ends up failing after that interval, and I have my error-count-to-fail-a-task set to 3, I’m potentially looking at 30 minutes of wasted time if no one is keeping a close eye on the farm.

If a delay feature were added at some point, it seems like it would make the most sense as a JobInfo key.

Interesting… would the Bad Slave Detection or Error Warning Message be of use?
thinkboxsoftware.com/deadlin … /#Overview

I guess the problem is that you don’t want an error to go undetected for too long, so if the job’s user was notified after 5 errors, action could be taken before the job fails.

The bad slave detection prevents slaves from getting hung up on problematic jobs, but it will only go back to the job once the bad slave list is cleared (or there are no other “good” jobs left). However, that might work too, since if the problem gets resolved, the job’s bad slave list could get cleared so that it gets picked up again.

Cheers,
Ryan

The warning message facility might be of use, but the real advantage of having a delay between retries would be avoiding as much wasted slave time as possible, without changing the schduling parameters of the job. Basically, if a task fails, another task could get a chance to run before the problematic task is retried (and potentially wastes more time). That would also give the job’s owner or someone keeping an eye on the farm a period of time to notice the errors and possibly suspend the job to prevent further wasted time.

The bad slave detection is useful as well, but it’s not as subtle as a task retry interval (which I guess is more of a temporary task blacklist, rather than a slave blacklist).

But in the example you listed, wouldn’t marking the slave as bad for the job after 1 or 2 errors be beneficial? Temporarily delaying the task won’t fix the problem, so if it happens overnight for example, it might still go undetected for a while. If bad slave detection is enabled, then slaves will never waste tons of cpu cycles trying to process problematic jobs if no one is around to intervene.

Also, in your example, the user will probably have to resubmit a new job with the fixed maya file. However, if the problem can be solved without resubmission, then the user can simply clear the job’s bad slave list so that it gets picked up again.

Cheers,
Ryan

Yes, and bad slave detection will definitely be enabled. However, if I want to have the error threshold for that at 3 instead of 1 (in case there’s, say one or two frames in the Maya scene that crash), a task retry delay would be the difference between the same slave spending 10 minutes or half an hour erroring before Deadline marks it as bad (assuming the slave immediately retried the same task it had just errored on).

Yes, this is true, but the delay wouldn’t be there to try and fix the task; it would simply be a way to try and keep the slaves busy doing useful work in case someone isn’t watching the farm for a period of time.

For what it’s worth, I realize this is kind of a nit-picky feature, and if it seems like something that would be difficult to possibly implement, it may not be worth considering at this point. However, it’s the kind of thing that can help squeeze a bit of extra efficiency out of a crammed farm during crunch time. Either way, I appreciate the discussion; it’s good to get push-back on thoughts like this. :slight_smile:

Implementation probably isn’t straightforward, but part of my concern with this feature is the “voodoo” that could be perceived when a slave isn’t picking up the job with the top priority due to the delay. I guess we would need a “delayed” slave list in addition to the bad slave list so that it can help explain that a delay has been imposed. I think it’s something that we can add to the wish list for future consideration, but it would probably remain low priority for a while. :slight_smile:

Cheers,
Ryan

Sure, that’s understandable, and thanks for at least keeping it in mind.

Just to reiterate, in the scenario I’m picturing, the actual delay would occur at the task level, but the duration would be specified as a jobinfo key (rather than a repository setting). Thus, for a high-priority job, every task would get at least one chance to run before the retry delay were imposed (if one were even set). The delay feature could also set a task status to indicate that an error occurred, and that retries are temporarily delayed.

Privacy | Site terms | Cookie preferences