Task re-queue weirdness

nrusch · June 13, 2014, 5:03pm

I have a job in which I manually failed most of the tasks at the start to prevent them from executing. I’ve since been queueing them up one at a time to test things, and I’ve run into some weird behavior. Here are the steps to reproduce:

Fail a task manually (automatic failure may work as well, but I don’t know).
Resume the failed task and let it complete.
Now re-queue the completed task. For some reason, it goes straight into the Failed state.

rrussell · June 13, 2014, 5:39pm

Is the job in the Failed state when you requeue the completed task? If so, then this behavior is currently by design because when a task is requeued, it’s state is based on the job’s state. For example, if the job was suspended, and you requeued a completed task, it would end up in the suspended state.

Cheers,
Ryan

nrusch · June 13, 2014, 6:23pm

I see. I have to say, I don’t really understand this behavior. If I need to retry a task in a job that has failed (e.g. for debugging, or just because Maya is an unstable joke), I essentially have to “re-queue” it twice.

Maybe it’s because I’m used to using other queue managers, but I expect a job’s state to be dictated by a combination of the states of all of its tasks, with some exceptions for manual user edits.

A job is queued until one of its tasks is picked up. At that point, it is active.
A job is complete if all of its tasks have completed successfully.
A job is failed if all none of its tasks are queued or active and at least one is failed.

If I re-queue a failed task, that means I want it to run again; there’s no other reason I would do that. Therefore, at this point, I would expect the job to return to the “queued” or “active” state (I don’t really know what the difference is in Deadline terms). Basically, a task’s state should never be dictated by the state of the job.

If I suspend a job, I would expect queued tasks to remain queued, active tasks to remain active (unless I specify a “kill” option when suspending), etc. Tasks in a suspended job can still be “queued”, but they are not qualified to be picked up until the job is taken out of the “suspended” state. It seems intuitive that the slaves would just use the job state as a “fast-exit” test to decide whether they can exclude any of its tasks, but that the state of the tasks is not directly tied to the state of the job.

Anyway, I realize this is all tinted by my own experiences, but I’d be interested to hear your thoughts on this.

Thanks

rrussell · June 13, 2014, 7:18pm

It is in most cases. It’s just that in the case where changing the state of a task to something that could be arbitrary is based on the state of the job. I guess in this specific case, maybe it’s more common to resume the entire failed job after fixing the problem so that all the failed tasks render again. It also sounds like being able to suspend individual tasks in version 7 will work better for this type of test scenario.

I’m not sure everyone would expect this behavior, but regardless, being able to suspend individual tasks in version 7 will also cover this case, and we’ve already added the option to suspend all tasks or all non-rendering tasks when suspending a job.

Cheers,
Ryan

nrusch · June 13, 2014, 7:34pm

I don’t quite follow… how would a re-queue operation put a task into an arbitrary state? If I re-queue a task, I don’t care what state it is or has been in; I want it to be queued back up for execution. This is an instance where the job state would ideally be dictated by a combination of its task states: If all tasks have failed, the job ends up in the Failed state. If I then re-queue one task, I would expect the job to go back into the “active” or “queued” state, since it was never manually suspended.

It probably is more common, but again, that’s a job-level operation, and it feels like it should essentially be a shortcut command that really means “re-queue all failed tasks.” However, task-level control is important; if I’m debugging something, I’m not interested in getting 80 failed tasks every time I iterate on something; I just want to retry one over and over until I can safely re-queue the rest, and right now, I have to re-queue it twice every time I want it to run one more time.

rrussell · June 13, 2014, 7:55pm

Sorry, I didn’t really explain myself well there. When I said arbitrary, I was more referring to what the expected state would be. In your case, you expect it to be queued, but that might not be everyone’s expectation. Deadline has always operated in the way that it works now, and you’re the first person that I can recall to bring this up as an issue, which makes us hesitant to want to change the way it works.

In Deadline 7, you can just suspend all the tasks except the one you’re testing, so you should be able to get the behavior your want here.

nrusch · June 13, 2014, 8:15pm

Yes, that’s true, but it could also be that I’m the first person to stumble across the current behavior, or to really question any of the scheduling logic or behavior; a lack of noise does not indicate complete knowledge of (and satisfaction with) the way things work now.

For the life of me, I can’t think of a reason why anyone would think, “Here’s a completed task. I want to mark it as failed, but I don’t just want to say ‘Mark this task as failed’. What I really want to do is pretend to re-queue it and have it end up marked as failed through the use of subtle trickery.” I’m guessing the vast majority of users just don’t know this behavior exists, and I would challenge anyone to provide a use-case for it.

In the bigger picture, I can understand not wanting to change application behaviors to a reasonable degree. However, at some point I think there has to be a willingness to depart from old ways in the interest of a more intuitive and efficient end result. I’m hopeful that core components and behaviors in Deadline can continue to evolve in this way without falling into The Autodesk Trap: “Well, we released it this way once, and we won’t consider changing it in a future release in case there’s someone out there who has come to rely on this behavior.”

rrussell · June 13, 2014, 8:28pm

How about we do this: We’ll log it as a wish list item, and if it’s something that starts to gain traction, we will definitely consider making this change.

Once you get your hands on Deadline 7, and compare it to Deadline 5 (or even version 6 for that matter), I think it’s safe to say that we haven’t fallen into that trap.

I should also note that we’ve been bitten in the past by changing how a feature behaved, only to have long-time users get upset that something they’ve come to expect to work one way now works a different way. So it’s not a case of us being set in our ways and unwilling to change things, it just means that we’re more cautious about it.

im_thatoneguy · June 13, 2014, 10:11pm

I just wrote up a long response and then realized I disagreed with myself… Take 2.

I agree with Nathan. If a job breaches its fail limit then all the rest of the tasks (not rendering) should be marked failed. That way you don’t lose tasks 0-100 which have all been rendering for 2 hours just because tasks 100-200 failed. As tasks complete or error they should then be marked COMPLETE or FAILED respectively (since it’s in 0 tolerance mode after the ERROR COUNT > FAIL JOB ERRORS).

If you Re-queue a FAILED task then it should attempt to re-render. But unless they clear the error logs or task error logs it will still be in 0 tolerance mode.

At the end of the job you might have a mix of failed and not-failed tasks depending on TASK FAILURE detection. You could then requeue those tasks and it would return to RENDERING mode.

im_thatoneguy · June 13, 2014, 10:16pm

This problem: viewtopic.php?f=156&t=11956 would be fixed with this as well.

It would just be a case of changing the FAILED JOB detection to either FAIL rendering tasks or wait for them to error.

[100] Failed Job Detection Error Count
[X] Fail rendering tasks on Failed Job Detection

nrusch · June 13, 2014, 10:22pm

Fair enough, although I still have a hard time imagining who would find this particular change surprising.

I understand that, and I wasn’t implying that you guys were already in Autodesk territory in that regard. Obviously there are a lot of big changes happening between releases, but I also know you guys have a lot of long-term customers. Now, obviously you don’t want to lose them, but every now and then, even the most venerated users have to consider that there may actually be room for improvement.

rrussell · June 16, 2014, 12:52pm

This might be something we could look at changing in version 8. We’re looking at a bunch of ideas to improve our scheduling system in Deadline, and job states more or less fall under that umbrella.

Thanks again for your feedback!
Ryan

LaszloSebo · June 18, 2014, 10:14pm

I agree, its an odd behavior to requeue something and then it ends up being failed…