Adjacency weighting for time-based failure detection?

nrusch · September 12, 2014, 6:39pm

Right now, the auto task timeout feature uses (what I think is) the flat average of the currently completed tasks times a configured multiplier to determine when task times are anomalous. (If Deadline already uses some kind of a weighting scheme, let me know.) I was thinking about some possible extensions to this feature set, and wanted to run this one by everyone to see what the reaction was.

The idea here would be to modify/extend the auto timeout calculation so that the render times of more adjacent tasks could be weighted higher in the calculation. This could potentially allow the timeout multiplier to be lowered more for jobs whose frame times fluctuate through the duration of the job (e.g. rendering an object that moves out of the frame over the course of the job).

It would also be cool to have some control over the weighting, whether this was as simple as a control over the number of pre/post frames that were weighted at 1.0, or something more complex.

Any thoughts on this idea?

Thanks

rrussell · September 12, 2014, 7:19pm

This sounds like a great idea, so we’re happy to discuss it more.

One question that initially comes to mind is: Should you consider if the adjacent tasks are completed or not? Because it’s a flat average right now, you can calculate the average as long as X tasks are finished. With this new system, would you want X adjacent tasks to be finished instead?

nrusch · September 12, 2014, 10:37pm

That’s a good question.

I think you probably understand what I’m talking about pretty well, but in case it helps anyone else, here’s a “diagram” of sorts. I don’t necessarily know whether a linear or constant falloff would be better, but this example shows the latter.

This assumes the following parameters:

Task 12 is the task we’re checking for auto-timeout
The “adjacency weighting threshold” parameter (for lack of a better term) is set to 5 (tasks).
The “minimum weight” is set to 0.2
The “adjacency falloff” is set to 5 (tasks)
These parameters would all be tunable.

Top row is task IDs, bottom row is weights

0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | [12] | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 0.2 | 0.2 | 0.2 | 0.36 | 0.52 | 0.68 | 0.84 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | -- | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 0.84 | 0.68 | 0.52 | 0.36 | 0.2 | 0.2 | 0.2

So yeah, then there’s the question of whether the tasks should be pre-filtered to only completed tasks before the weighting were applied, or whether the weighting should be applied first, but without pre-filtering the task list (obviously pre-testing the whole thing for cases where fewer than X [or Y% of] tasks were completed). My gut reaction is the latter (no pre-filtering), to prevent distant tasks from gaining increased weight just because adjacent tasks aren’t complete yet, but there may be some negative side-effects to that that I can’t think of immediately. Or maybe it would require a certain number of tasks within the adjacency threshold to be complete before switching from a flat average…

Anyway, definitely some things to think about. Obviously we don’t want the timeout test to become too expensive in terms of database traffic or computation.

rrussell · September 15, 2014, 1:03pm

Thanks for the additional info (and the diagram)! I think I agree with you that no pre-filtering should be done. I think it could result in unintended results. For example, if there was an object flying by the camera, those frames are obviously going to take longer, and if pre-filtering is done, odds are the initial timeouts for those longer frames will be based on the quicker frames that have already completed, which defeats the whole purpose of having a weighted system.

I’m logging all of this as part of a feature request so that we can continue to think about it. If you have any additional comments to add, just let us know!

Cheers,
Ryan

nrusch · September 15, 2014, 4:38pm

Will do. Thanks Ryan.