Feature Request: Strict RAM restraints, cancel/throw error

anon28194136 · February 25, 2015, 7:55pm

We’ve been battling against jobs on our farm that consume epic proportions of RAM and end up stalling out machines. We have our own tools that have been monitoring RAM usage and killing processes and shutting down slaves etc. I was hoping maybe there could be an option in Deadline that could handle this kind of feature instead. What would be nice to see is a feature that allows the slave to check the amount of RAM being used by the slave and whatever program it’s launching be it 3DS Max or Maya or whatever. If the RAM consumption goes above a threshold, the slave throws an error on the job it was rendering and cancels the render.

This prevents the paging/swapping and eventually would lead to a cleaner farm. The user gets feedback about the job via the job reports, the farm stays healthy. (We’ve been battling against machines stalling/locking due to epic page file usage). In an ideal world, these users wouldn’t be submitting jobs that try to use so much memory, but it’s happening and we need countermeasures against that scenario.

Coulter · February 26, 2015, 1:24am

So this would be analogous to Task Timeout, but instead of failing a Task because it exceeded a time limit, it would fail a Task because it exceeded a RAM limit. Is that correct?

It’s an interesting idea. One tricky part is that some programs tend to generate brief RAM spikes that then quickly drop back to a reasonable level. Over the life of the process there might be a few brief high spikes, but overall there would be very little swapping because the spikes are brief. But these kinds of processes would be tripped out by a Task RAM constraint.

MikeOwen · February 26, 2015, 9:03am

I think you would need 2 variables here.
a. RAM Level Timeout (measured as a % of RAM used on the system) ~98%, default=0 meaning disabled.
b. RAM Timeout Threshold - measured in sec/mins, where the machine is holding at x% RAM level for this period of time ~2mins, default=0 meaning disabled.
This should even out the RAM spikes that James mentioned, reduce false positives but hopefully kill a slave before it starts thrashing.

rrussell · February 26, 2015, 4:35pm

Currently, the slave only pulls the RAM usage of the rendering process at the same rate it updates its slaves status (a setting that can be configured in the Performance Settings in the Repository Options). If a RAM timeout is set, we would probably have to increase how often the slave pulls this information.

What’s a reasonable minimum for the RAM timeout? 30 seconds? 1 minute? If we make it too low (ie: 5 seconds), then the slave would have to be pulling this info at least every 5 seconds for it to matter.

LaszloSebo · February 26, 2015, 5:16pm

That’s the tricky part… Some of the spikes can happen within a couple of seconds, and off goes your machine to eternal limbo. So the monitoring would have to be at least every couple of seconds.

Another (major) challenge is that deadline is a managed application, and is at the mercy of all the application layers beneath it. Our monitoring tool is also somewhat similar, its running in python (and checks every ~10 seconds currently i believe). Yet, when vray starts loading those buffers up, both deadline and our python monitoring tool hangs up due to dried up resources in a good portion of the cases (the timers simply stop firing).

Deadline is the ultimate farm management software, and it would be great if it could reliably manage all aspects of the subprocesses in a way that’s not limited by implementation details (such as .net’s memory / resource allocation methodology). Implementation details should not drive the actual functionality, it should be the other way around. Ideally, this piece of monitoring would be a simple c++ service with preallocated resources, running at realtime process priority. Thoughts?

nrusch · February 26, 2015, 5:41pm

I actually disagree. I think polling every minute or two (or whenever the slave info is updated) would be perfectly adequate. Having a render node swap for a short period of time isn’t worth the trouble of trying to suss out and kill. CPU time is much cheaper than person time.

LaszloSebo · February 26, 2015, 6:27pm

We are polling at 10 seconds currently with our own tool, and its not adequate, hence this thread

Ram usage can go up really fast, and then stay there, so spiking is the wrong word here:

What we want to have is a “hard” limit, no swapping ever, never ever never. If a job needs to swap on a 128gig machine, we are doing something wrong, and the render has to be kicked back to the artist.

The current behavior is that these jobs simply disable 1-200 machines every night, which is a considerable percentage of our machines. The farm has to be able to manage this and throw errors appropriately, instead of us having to send IT folk to our data center to hard reset these machines every morning… as you said, machine time is cheaper than person time.

nrusch · February 26, 2015, 6:47pm

Yes, but what does it matter if it’s kicked back 30 seconds earlier?

LaszloSebo · February 26, 2015, 7:32pm

I’m not sure what you mean? They never get kicked back currently. The entire machine, deadline slave, OS etc is completely frozen due to the swapping till manual intervention.

To clarify, the current behavior is:
Job goes over system RAM, and the machine starts swapping. Managed apps, but even the OS basically stop functioning. The slaves freeze up completely (they never come back till manually reset). The task timeouts are not handled by deadline, since deadline itself is also frozen. The stalled machine handling finds them eventually and requeues the tasks (after a long period of time). The machines stay offline till someone finds them in the morning (and those machines have thus wasted thousands of dollars in cpu time). Jobs like this can take out large chunks of the farm, not to mention that they take a much longer time to actually get marked as ‘failed’, since the stalled machine warning has to catch them instead of the normal error handling.

Ideally what we would have instead is:
Tasks reach the hard ram limit and they fail instantly. The job accumulates enough errors, gets failed, the artist gets notified, and the farm stays healthy. No machines go offline / stalled, no people have to be dispatched to hard reset machines throughout the day.

Believe me, there is a reason this thread started. We have been fighting with this issue for many months now and its an increasingly unmanageable problem as we get in show crunch times, where people have less and less time to optimize their scenes. We currently (as i write this) have ~260 machines either offline or stalled, the majority caused by this issue. ALL of these machines have to be manually fixed, one by one. Tomorrow, it will be a similar number. This means ~200 machines are as if they did not even exist. That’s hundreds of thousands of dollars, wasted completely.

This also takes about 10-20 minutes from a lot of artists’ work time every day, since when they come in in the morning, their workstations (used for rendering overnight) could also be frozen up. Instead of them being able to go to work instantly, they have to notify IT who finds their box in the data center and hard resets them.

nrusch · February 26, 2015, 8:31pm

Sorry, I see what you mean now. I’ve run into this same issue, and yeah, the only way to recover is to hard-reboot the node.

Example of this kind of thing (albeit Linux-specific, and written in Perl): https://github.com/pshved/timeout

I have a feeling this kind of thing would be easy to do with platform-native binaries (along with probably resolving a lot of weird edge-case issues we run into currently). On Linux, I think you could use fork → setrlimit → exec[ve] to actually cap the child process’ available address space, so that trying to malloc over that boundary would trigger a segfault. On Windows… I have no idea.

So maybe what Deadline would need in the shorter term is a sort of platform-native “arena” process that actually exec’s the render process…

anon28194136 · February 26, 2015, 8:35pm

Yes it’s as Laszlo said. Hundreds of machines die within a period of 2 day’s time due to scenes being submitted to the farm with outrageous requirements for RAM. We have previously operated under the policy that there is absolutely 0 RAM swapping allowed on the farm. If a job requires over 128GB of RAM then there is obviously something wrong and the job should be rejected entirely. The problem with these jobs is that many of the machines they land on get completely locked up and we just end up with a STALLED SLAVE REPORT which to most people doesn’t really offer enough information to know what the problem was. Like Laszlo said, this stalled slave report tends to come after a long period of time of rendering (sometimes even over 12 hours of rendering before the machine finally stalls and dies after it has consumed its entire page file and Windows dies).

It would be ideal for us to have very strict restrictions for this, with some verbose message to the user that explains the situation for them. What we would want is 0 grace for swapping, period. This would ensure that Deadline, Windows, and everything else would remain responsive at all times even though it might be canceling some jobs it would save everyone time and $$ if the job was rejected instead of being attempted and stalling out a bunch of machines. I think eventually it could lead to a cleaner and more productive workflow for our projects too if we had this very strict RAM checking built into Deadline. It would force people to really think about what they’re rendering and how, instead of throwing 10 versions of something at the farm and seeing what sticks.

Obviously having the ability to change a couple values for this would be great. I like the idea of having the ability to set the ceiling for RAM, and also the interval for the checks. We could then tailor it to suit our needs (whatever values leave us without stalled machines each morning).

nrusch · February 26, 2015, 9:43pm

You could always disable the page file on your render nodes…

Really though, I think a platform-native method for setting a RAM cap for child processes would be the best solution without going fully platform-native across all binaries. I may experiment with my own binary for this as a proof-of-concept.

rrussell · February 27, 2015, 2:07pm

It looks like it’s possible to limit RAM usage on Windows as well via Job Objects (which we use):
msdn.microsoft.com/en-us/librar … 56(v=vs.85.aspx

However, what about the case where something else on the machine is already using up significant RAM (ie: multiple slaves rendering, an artist leaves Max open, etc)? It’s one thing to try and limit the RAM of the rendering process, but what about other processes that Deadline has no control over?

The alternative would be for the slave to just check the system’s RAM usage and requeue its task if it hits a certain threshold.

LaszloSebo · February 27, 2015, 5:18pm

Hi Ryan,

I see that this job object lets you control the ram usage of multiple sub processes of the job. Could all slaves be spawned within one job context by the launcher? I’m not familiar with this concept.

I think this is what we are looking for, but in a way that the limit is somehow guarantied by the OS. Otherwise it would not be robust enough to deal with rapidly increasing ram usage that starves the .net slave application of resources, stopping it from doing its checks.

rrussell · February 27, 2015, 5:30pm

In theory, but it still doesn’t solve the problem of processes that aren’t part of this job object (ie: an artist leaves Max running).

Unless there was some OS level setting to control this, I don’t think it would be possible to guarantee that the system’s memory usage never goes over a certain value.

LaszloSebo · February 27, 2015, 6:18pm

That’s true. In our context its not so much an issue, as when our “farm mode” kicks in, we force shut down all max/maya/nuke licenses. But in general, it probably would still cause issues for other clients.

What do you think about a c++ ‘realtime’ priority service, with all its ram usage pre-allocated in advance? I’m not sure how those would react in such situations, but is the best i can think of. Of course, i had no idea about these job objects neither so who knows,… might be some winapi call out there for such issues.

nrusch · February 27, 2015, 7:54pm

“There’s only so much idiot-proofing you can do before you have to remove the idiot.”