Failed jobs not failing all tasks?

im_thatoneguy · June 10, 2014, 10:45pm

It would be really really nice if in 7.0 failed jobs didn’t fail the tasks that are happily rendering would simply not render later. For instance if a single server racks up 100 license errors in a row it wouldn’t fail the other 5 machines that have been successfully rendering for 30 minutes.

rrussell · June 11, 2014, 12:53pm

Hmm, I’m not sure if this would be the desired behaviour in all cases. What if, for example, a task would always throw an error 1 hour into a render, and when it reaches 100 errors, there are still 30 machines currently rendering? That could result in a lot of wasted render time waiting for those 30 slaves to get to the point of failure. It’s one of those things where making a change could help some cases, but could make others worse, which makes us hesitant to do so.

Cheers,
Ryan

nrusch · June 11, 2014, 4:38pm

It sounds like you want to turn job failure detection off and use task failure detection instead.

im_thatoneguy · June 11, 2014, 6:51pm

I guess that’s a good first step. But it’s not usually a failed task, it’s just a license error with too many machines trying to render. So neither the job nor the task is at fault nor even the machine. It’s a “state”.Even task detection wouldn’t be ideal because what would happen would be that the slave which can’t get a license would rapidly rush through hundreds of tasks while the working machines finish up their tasks.

I think the best approach would be a little bit of logic which detects a ‘state fault’.

If an error is a startup error and there are other tasks which have gotten to the rendering stage beyond startup then it’s noted as a startup fault. If the startup fault seems to be random then it won’t fail the job and assume that it’s a transcient error. If however there seems to be a patter, aka a slave has never successfully finished a task but other slave have then it can be labeled bad.

nrusch · June 11, 2014, 6:54pm

Are you using limits to control license allocation?

im_thatoneguy · June 11, 2014, 7:00pm

Yes, but some licenses get checked out, outside of deadline. E.g. Nuke -i licenses.

MikeOwen · June 12, 2014, 9:01am

So, looking at the root cause of this issue, it’s the in-ability for say, limit groups when used as a software license ‘cap’ to query your license server, ask how many licenses of a certain type are currently available and then make a call as to whether any more may be checked out for use by the farm or leave a certain number available for users to be able to pull the “-i” licenses?

I’m sure the Foundry would be more than happy to sell you more rendernode licences or advise you that “-i” interactive licenses are not really designed for network rendering. On the flip-side, Nuke licenses are expensive and when interactive licenses are not required by users, it would be efficient to utilise them as additional render nodes. So, the only way to find a middle ground, is if the current license check-out/count availability ‘state’ can be queried and the limit group adjusted accordingly on the fly, so you never hit this error situation.

Easier said than done, although I do have some ideas

nrusch · June 12, 2014, 5:03pm

Ah OK, that makes sense. We don’t have that problem but we’re in a similar situation with some plugins, in that some license servers are partially or completely shared between two studios with two different farms. Defining two limits, each with half of the total license count, is a potential waste, but giving both limits the full license count is asking for unnecessary failures.

The simplest idea is to keep a cron running to update the limits in the repository based on the usage in the remote studio, which I’m considering for this particular problem. However, this would probably need to run every minute in order to be granular enough. Having the slaves query the license server every time they need a limit may not be ideal either, especially since some of the license servers we’re talking about are extremely primitive (in other words, you can’t actually query them).