AWS Thinkbox Discussion Forums

consistently stalled slave

As you can see it’s just been spinning its wheels for 5 days. The slave looks to be running properly. The system isn’t frozen.

It’s still according to itself rendering an old job ‘####_0010_MASTER_B09.01’

But the monitor aptly calls it “Stalled” instead of rendering. However the slave still accepts commands and I can have it restart the slave but it still lists it as offline or stalled depending on if I set it to “offline”. The ony thing that seems to get it to render again after this state is a reboot.

There’s a thread responsible for updating the Slave’s status. I don’t remember what we call it offhand, but “slave info thread” seems right at the moment.

When you restart the Slave, does it at least update its status once? If not, I’ve got to assume there’s something in that code path that’s causing the thread to exit. I’ve actually heard of this problem before, but it’s been a long time (never did track it down). Is there anything in the Slave log when you stop and start the Slave process?

Hey Gavin, would you be able to kick that whole log over to me in a rar or 7z? It’s edwin.amsler@thinkboxsoftware.com

I should be able to figure out where the heck the slave info thread died (if it did).

Edwin, did you ever find out what was causing this?

Because I’ve got exactly the same thing happening on one of our slaves.
I just realised that it hasn’t actually rendered anything for days.
It doesn’t stall, it just accepts the task, sits there thinking about it for 5 minutes, and then another slave will render the task.
It also doesn’t throw an error.

However, the log file looks the same:

2015-12-04 09:49:16: 0: Plugin rendering frame(s): 970-974 2015-12-04 09:49:16: Scheduler Thread - Task "194_970-974" could not be found because task has been modified: 2015-12-04 09:49:16: current status = Rendering, new status = Rendering 2015-12-04 09:49:16: current slave = schimpanse20, new slave = schimpanse51 2015-12-04 09:49:16: current frames = 970-974, new frames = 970-974 2015-12-04 09:49:16: Scheduler Thread - Cancelling task...

This is happening with After Effects jobs right now, but I guess it’s also been doing the same with Max, Nuke and C4D (no time to check right now).

Dave

Could you confirm what exact version of Deadline your running? I thought this was fixed ages ago.

v7.2.0.18

From 80 slaves, this is the only one having the problem.

Dave

Update: now the other slaves are starting to do the same.
This is making all of the render jobs take a lot longer, and there aren’t even any errors being generated so unless you sit and watch the monitor for long periods of time, you wouldn’t know it was happening.

Hi Mike,

any news on this?
The render farm has spent about half of the weekend just accepting tasks, thinking about it for 5 to 10 minutes, and then passing the task on to another slave…

Cheers,

Dave

Hi Dave,

I was thinking you might be running an older version than 7.2.0.18 as that would explain it as the orginal thread was back in July against 7.1 IIRC.

Could you drop a note to support@thinkboxsoftware.com, referencing this thread and I’ll get the support team to schedule a remote debug session with you this afternoon.

Just to confirm, in the slave & pulse panel, the “Version” column field is displaying exactly the same version of Deadline for all slaves & pulses? Can you run a House Cleaning and Repository Repair from your Monitor -> super-user -> tools and if it’s not already enabled, can you enable verbose logging for Pulse and Slave here in your repo config:
docs.thinkboxsoftware.com/produc … ation-data
If you are running Pulse, no harm in giving the application a restart.
Finally, has anything changed recently in your pipeline, which might explain this behaviour? Have IT changed anything/ deployed any different settings across your machines?

Thanks for the help Mike.

All machines are up to date, and pretty much identical.
The only thing that has happened recently is that we ramped up from 50 to 80 slaves.

I’ve turned on the verbose logging, but I’m going to have to wait before I submit a support ticket as we’re currently completely stressed out and I just don’t have any time to spare.

In any case it’s back to being one machine that does this all the time, and the other machines have a hiccup every now and then and miss a frame or two, but then they are back to normal.

Cheers,

Dave

ok, understood Dave. I’m sorry for your troubles.
Perhaps, “disable” this one slave that is causing you issues, leave it a few days to confirm it’s definitely just this one machine? Perhaps, if you used Auto-Upgrade, this went screwy on this one machine. Either way, it won’t do any harm, just to manually un-install the Deadline client software on this one slave and re-install it a fresh, to see if that helps? The verbose slave logs on this one machine should hopefully stand out as being different to all the others for some reason.

I just tried the uninstall and re-install, and it made no difference.

I’m going to get IT to copy the original image back onto the slave, and then we’ll see if that makes a difference.

Dave

So, after digging into some previous tickets, my guess is that there may be two Slaves running on the same machine. Because the Slaves have the same name they’ll show up in the database as the same Slave. Because of this they get confused because they see themselves working on something when they’re actually idle. They then requeue that task.

Can you take a look at the running processes on that machine and see if there are duplicates? This would be really rare on Windows, but worth a look.

There was definitely only one slave running at any time, I spent a while yesterday just watching the task manager to see if anything strange was happening.

Dave

Okay, so next test. This should be a problem somewhere in Deadline’s house cleaning. Are you running Pulse? If not, can you start up Pulse on some machine so that it will handle house cleaning on behalf of the Slaves and we’ll watch it? We’ll

docs.thinkboxsoftware.com/produc … ning-pulse

Just to be safe, enable verbose logging so we catch everything:

docs.thinkboxsoftware.com/produc … ation-data

We’ll follow along and see what ends up happening. I just remembered a client who had this problem ended up not having any issues once house cleaning was handled by Pulse, so if that’s the case for you, I’d like to revert back and dig a bit deeper so we can solve this one.

Hi Edwin,

we are running pulse, and a couple of days ago I turned on the verbose logging.
At the moment the problem slave is disabled, but as soon as the worst is out of the way I’ll enable it again so we can see what Pulse has to say about the problem.

Cheers,

Dave

Okay, a bit more information:

The slaves are accepting a task, and then basically spend much too long in the “Starting job” phase without actually starting the job.
While they are just sitting there doing nothing, Pulse does a check for orphaned tasks and reassigns the task to another slave:

Orphaned Task Scan - Requeuing orphaned task '58' for schimpanse61: the task is in the rendering state, but the slave is no longer rendering this task

After this happens, the original slave puts this in the log file:

2015-12-04 09:49:16: Scheduler Thread - Task "194_970-974" could not be found because task has been modified: 2015-12-04 09:49:16: current status = Rendering, new status = Rendering 2015-12-04 09:49:16: current slave = schimpanse20, new slave = schimpanse51 2015-12-04 09:49:16: current frames = 970-974, new frames = 970-974 2015-12-04 09:49:16: Scheduler Thread - Cancelling task...

So the real problem has more to do with the fact that the slaves accept a task and then seeming to do nothing.

This keeps on happening irregularly, and after a couple of times the slaves seem to get themselves sorted out and start rendering properly again (all except the one machine that doesn’t render at all any more).

Yesterday evening this happened simultaneously on about 30 machines, and I nearly had a heart attack…
It’s just happened again with about 10 machines, but most of them are back to rendering again.

Dave

Sounds good! You can bundle them all together (with 7zip in WinRAR) once you have some time and send them into the support system. I’ll take a look and see what I can find out.

Privacy | Site terms | Cookie preferences