AWS Thinkbox Discussion Forums

Memory limit for jobs?

Is there a way to kill a task if it reaches a certain memory threshold? I’ve been having problems with artists sending memory heavy jobs to the farm that take down nodes. This isn’t much of a problem with our server grade machines. But it is really disruptive when our use machines hit the farm.

I don’t think there is a built in functionality to monitor the percentage of memory usage and act upon that. Maybe it could be scripted as a house cleaning script that scans all workers periodically and if their memory usage above a % threshold, adds them automatically to the Deny list (formerly known as Blacklist) of the Job and requeues the Task so another Worker can take over, and the Worker in question can move on to something less demanding…

The native behavior of Deadline is that if a Worker fails Tasks of the same Job several times for whatever reason, it adds the Job to its own ignore list to avoid failing for the same reason again and again (which would be equivalent to madness according to Einstein). However, if running out of memory crashes the Worker (as you said your nodes are going down), then this wouldn’t help you much. You might want to clarify what the effect is on your Workers - do they crash/stall?

The manual way to deal with this would be to create two Deadline Groups - one for high memory machines, and one for the rest of the office machines as you described. Then make sure the artists always submit their Jobs to the High Memory group, and tag Jobs with low memory requirements with the second group. This way, you are giving Deadline a clear direction which machines should work on which Jobs based on the hardware requirements. The same can be implemented using Limits and their own Deny/Blacklist in case you are already using Groups for something else.

Thanks for the ideas. Currently I already do have the user machines segregated from the server grade machines. During normal times, we would put user machines on the farm at night. If the machines crashed it was no big deal as we would fix it in the morning. But with everyone WFH, it’s a real hassle to restart these machines when they crash. Which is why I’m looking for a why to prevent jobs from using all the memory. I’m just trying to get back that extra horsepower at night without depriving the artist of their machine in the morning.

Privacy | Site terms | Cookie preferences