AWS Thinkbox Discussion Forums

Task Timeout and Weighted priority

Hi,

We’re just doing some experiments to optimise the way our render farm works. We want to have a more fluid priority system so that jobs get dropped down the list if they accumulate errors but also so that the renderfarm doesn’t pick up 20 jobs at once. So, we’ve switched over to “Pool, Weighted, First-in, First-out” and this seems to be working for us. We’ve just got a couple of challenges I’m hoping to resolve:

We’ve had to give a lot of weight to the priority (1000) because we can’t put a small enough number in the ‘Submission Time Weight’ as it’s set to 4 decimal points. Can this be increased?

At the moment the task timeout works as a hard limit. Unfortunately this means that a task can get to 95% complete and then crash out. Is there a way to add more logic to this? eg if it’s at 95% complete allow an extra 10-20% to the render time? This would save us lost render time. The other suggestion was to have a warning timeout, Almost like the data warnings you can set on your phone, where you have a time that warns the user/system and a time that errors the task. If we could then feed this back into the weight as well we could reduce the priority of the job and reduce the impact of unexpected slow jobs.

Alternatively you might have suggestions for how we can make this work within the current setup. I have suggested the artists submit preview jobs to see how long their renders take, but this doesn’t always happen and it only takes one job to slow down the whole system!

Thanks

Nick

For the weighted algorithm, I’ll have to ask around.

As far as the timeouts, is that to work around a particular problem? We’d probably want to try resolving the issue, but if not we should be introducing an event that can run at certain intervals on all Slaves, and that should give you the hooks you’d need to implement your own timeouts. Something like this:

  1. Is Slave rendering?
  2. Is job of type X?
  3. Has the Slave state being “Rendering” for Y minutes?
  4. Is task at < Z% complete?
  5. If all of the above are true, have Slave fail its current task.

You can even go one more step and check CPU usage. I’m excited to see all the interesting events people will be able to build. It should be implemented for Deadline 8.1’s release.

Hi Edwin,

With the weighted algorithm it’s just that I can’t put a small enough value in box because it’s limited to 4 decimal places. It doesn’t really cause a problem apart from the fact that it means my weighted values are massive to compensate for it.

The new event might be a part of the solution, we’d need some way to feed that back into the priority of the job though. We basically have a load of technical films that we run through the studio, generally we can expect a certain frame time but sometimes a material or object will cause the file to slow right down or it will just creep over the task timeout and fail. Having a way to reduce priority on these jobs rather than failing them would mean problem jobs can be pushed to the back while the faster jobs run through. They will then pick up once the farm is clear and even if they’re slow we might still get the renders.

Nick

It sounds like you could definitely use some sort of hybrid approach to this… I’d say something like a cross between balanced (so each job gets a share of the farm) but where the share amount is dictated by the priority and how fast the job is. I’m hoping we open things up enough in Deadline 9 to allow for custom queue scripting (it’s definitely the plan)… I think even a generic formula with more variables would work great.

I think we should play around with some ideas here and see what can be done once the heartbeat event is in place.

Up until now we’ve always relied on the pools and priorities to manage the jobs. But we do lose render time when we hit the task timeout or have jobs that start to error. I’m already liking the way weighting works, although it’s taken a little playing to work out (it doesn’t seem to update in my Monitor when I change the repository settings. I have to restart the Monitor, not sure if this might be a bug?). Recently we’ve been trying to get the artists to put everything on at 50 and then I’m balancing the jobs depending on the deadlines. Unfortunately if I’m not on top of this we can sometimes end up with the renderfarm split across a large number of jobs and then only partial finishing all of them instead of prioritising a few.

My hope is that with the weighting I can have lots of jobs on the same priority but then balance them so that only 4/6 will render at once but should one start to error it’s pushed down the queue. To be able to also have a projected render time on a job that is then taken into account on the priority could be very useful I think. But similarly this could be useful on the slaves, if a slave is struggling on a job or has maxed out the RAM, being able to push that slave onto other jobs would be useful. It could then try again once there’s nothing else for it to do.

The other thing I’ve been thinking could be useful is interruptible pools. We have a number of machines that render After Effects jobs. At the moment I’m leaving those out of any other pool so when the job is submitted there’s no wait time for it to start. Being able to have that machine drop anything for the After Effects pool would be useful. I know we can do this on jobs but that’s not really what I’m after.

Thanks

Nick

It’s funny you mention “interruptible by pool”. I made a note for myself to make an issue for that about two days ago based on your last post. I’ve created it proper now.

In the mean time, we could create an OnJobSubmitted to enable job interruption if the job is submitted to a certain pool, but that’s not a dynamic solution.

Yeah it’s something that I’ve thought would be useful for a while. Sometimes you just get a slow job that you want to be able to break into it if a specific type of job is submitted (eg a still render or comp job). Ideally we’d want some intelligence to it to though to prevent it from booting a 95% complete frame. Normally I would manually kick a slave if an artists asks or tell them how long it will be before a slave jumps on. So being able to say something like “if no slave set to complete in ‘n’ mins or task progress <10%, boot slave”.

Nick

Was there ever any progress or discussion about the ‘Timeout Warning’ setting?

We’re seeing a lot of wasted time with renders failing with only a short time to go - so this warning threshold would be a benefit to us as well.

Our current solution is a custom event plugin that catches timeout error callbacks, and if the job is progressing calculates a new timeout setting based on existing rendertime/progress. However this is blunted by the automatic requeuing of jobs that timeout. We still lose all our progress so far without a chance to do anything about it. A ‘Timeout Warning’ callback would give us the chance to recalculate the timeout settings before the job errors and fails/requeues itself.

Alternatively, the option to have a job that has hit its Timeout limit send a notification/error but continue to render, could be a solution, but then the Timeout effectively becomes just a ‘warning’ anyway. This is close to the “Notify” setting - but as far as I can tell them is no Callback triggered here, and we’d ideally like to avoid sending emails every time we recalculate the Timeout settings.

I’ve reached out to a developer on these. I just want to hash out and understand what it would take to have a soft timeout that just notified.

We’ve talked about implementing Notification plugins before, which would provide hooks for this. We’re also looking at work-shopping the design of Plugins a bit, since this is an area that definitely needs modernizing. All in all, we’re definitely looking to provide end-users more control over the Plugin behaviour without requiring them to necessarily modify the stock plugins themselves.

So yeah, if you’ve got any other specific feedback on what your ideal system would look like, I’d love to factor it into any Plugin re-design we do!

Ideally a setting that ties in with the Timeout value:

  • User can specify a Timeout setting as usual
  • User can enable the ‘Timeout Warning’ feature if required
  • User enters a value eg. 5 minutes - which denotes how far in advance of the timeout the ‘Timeout Warning’ triggers
  • User selects behaviour: - Notification, Error message/callback, or condition based

Condition Based: Setting behaviour based on the task status ie.

  • if the job is > X% complete: extend the timeout by Y minutes
  • if the job is < Z% complete: recalculate the timeout and drop the priority by 10

This is what I already do with the timeout error callback, however it’s blunted by the automatic re-queuing

My two cents:

I think we could do it this way:

Have a ‘soft-timeout’ event that can trigger before the real timeout. Allow changing of the soft/hard timeouts from the event in such a way that the Slave notices the change to the job (may be there already). Then, provide some event script with configurable options that can take into account any needed rules.

That should allow for different business cases like “If my priority is X or my pool is Y, adjust the timeout”.

Privacy | Site terms | Cookie preferences