AWS Thinkbox Discussion Forums

Stalled Slaves

Can anyone give some insight into why slaves get marked as stalled?

I’m trying to dive into why our slaves stall out after certain rendering tasks. Vray DR jobs (submitted through the deadline to Max submission and marked as DBR jobs) tend to cause a lot of stalling out.

Thanks!

Here you go. :slight_smile:

docs.thinkboxsoftware.com/produ … ave-states

Thanks for that link. When our slaves are stalling out, if we enable Ping in the monitor, the ping shows as timed out. We can’t even ping the machine through the command prompt. However, if we send a restart slave command through the deadline monitor, the slave comes back online. I know we can set the option to restart stalled slaves, but I’d like to get to the root of them stalling out or timing out on the ping.

We also have slaves that are stalling out in the middle of a render, be it a DBR render or a tile render. Those are the 2 types of renderings we do since we only render high res stills, DBR or jigsaw tile rendering. We try to throw the entire farm at a single render to get out it fast. We had one tile render where a single tile’s machine stalled out and held on to the task. Making what should have been a 15 minute render, take over an hour as we didn’t catch the stall in time.

It’s like I have 22 teenagers with bad attitudes working on my rendering farm and I have to go kick them in the butt every so often to get them working again.

Teenagers with attitude you say?
tvtropes.org/pmwiki/pmwiki.php/ … thAttitude

Joking aside, the ping from the command prompt not working may have been blocked on the machine. ICMP is different from TCP and UDP and can be blocked separately so isn’t always indicative of issues (though it may be here). If the ‘restart’ command is working then the Launcher is still alive and the networking is doing well, so I think we need to focus on the Slave process.

I’ve seen a few cases where the Slave (or processes internal to the Slave) has crashed and so they’re either not running or in a zombie state. Can you log into one of those machines when it happens and check to see if “deadlineslave” is actually a process that’s currently running?

Even when they’re stalled, they’ll show how much free memory there was on the machine the last time they updated themselves, so we can check to see if that’s a factor. Then, go into the Slave logs to see if there’s anything obvious:
docs.thinkboxsoftware.com/produ … html#slave

I think we’ve chased down all of the stalls. We haven’t seen them pop up in a day or so. Turns out, IT decided to push out a policy update to all machines (including render boxes) that put them to sleep after 15 minutes of inactivity.

The ping command would work otherwise, unless it was shown as stalled or timed out on the monitor.

The restart command sometimes would say it failed as it didn’t receive a response from the machine. Then just as soon as the that error would pop up, the machine would wake it’s lazy self up.

We also turned off Low Thread priority in Vray as that may have been an issue as well. We had that one for a long time as sometimes we’d need to render to user boxes, but now with a dedicated farm that option is less relevant.

We also chased down a lot of populate data errors, even though I thought that was not longer part of Max 2018? Does it matter if we’re rendering on Max 2018 but have the 2017 and 2016 versions still installed?

I’m glad you have that mystery solved.

As far as Max 2016-2019, in my experience the big part is making sure BackBurner is at its latest version. I’m not sure how relevant that is these days as I believe we found a way to work around needing BB.

Privacy | Site terms | Cookie preferences