AWS Thinkbox Discussion Forums

Various Issues with slaves

This problem should be fixed for all cases in RC3.

We were able to reproduce the problem in the case of just shutting down the machine manually. We fixed that, and then tested the other cases where this problem occurred before, and everything seems to be working as it should be.

So hopefully RC3 fixes this once and for all. We’ll try to get RC3 out this week.

Cheers,
Ryan

This sounds awsome! I’m so looking forward to this :wink:

Cheers,
Holger

hey Ryan,

please don’t kill the messenger! :wink:

I just upgraded our machines to RC3 and the issue is still there. After the upgrade i just waited for all the machines to be shut down by Pulse. That’s when it happened that one of them was marked as ‘Stalled’ just like in those cases before. According to the Slave log it was shut down 19:33, i noticed though that the Monitor said that the last update from this Slave was from 19:42. Don’t know if that’s a hint for anything. Time on all our machines is in sync as we’re using NTP.
I also attached the “Stalled Slave Report” this time, in case it helps.

Cheers,
Holger
Deadline_logs_2014-11-29.zip (113 KB)

Hey Holger,

It feels like we’re playing wack-a-mole here. :slight_smile:

We tracked down another bug that could very well be the source of this problem. This bug would prevent the slave from ever reporting its state again after the first time it does it when it starts up. This would result in the behavior you’re seeing, since the slave would never mark itself as offline when it closes. Based on the log you posted, the slave DID shutdown properly, so it makes sense that this new bug is the reason it was still marked as stalled.

We’re fixing this in RC4, which we will push out this week.

To confirm if you’re seeing this bug, you can watch the Slaves panel in the Monitor to see if there are some slaves that don’t seem to ever update their state (you can watch the “Last Status Update” column).

Cheers,
Ryan

hi Ryan,

yeah, funny game, eh? :wink:

What exactly should i look out for in the ‘Last Status Update’ column? How does the time it puts there relate to the incorrect setting of the Slave’s state.
As it just happened again with 3 Slaves at once after finishing a job. I attached a Monitor screenshot. What can i read from the time there or what should i actually check to see whether this is the bug that you’re suspecting?
It actually also seems it’s become worse again. I just saw that i got 7 mails about stalled Slaves within around 1,5hrs.

Cheers,
Holger

Just see if the time it shows there ever changes while the slave is running. In your screen shot, most of the slaves are showing 15:35 or 15:36, but CELL-RS-12 is still showing 15:31, which means it hasn’t updated it state in 5 minutes compared with the other slaves. I bet if you manually closed this Slave, it will still appear as idle after the Slave application has closed. Or, if you left it running, it will eventually get marked as stalled (like CELL-RS-18, CELL-RS-20, and CELL-RS-22).

Cheers,
Ryan

Indeed. When i just checked cell-rs-12 it was running and rendering fine. At this very moment (15:56) it’s listed as ‘Offline’ although it’s actually ‘Idle’. Some more were just set to ‘stalled’.
This feels really like things are becoming worse and worse right now.
Any chance to get RC4 today? :wink:

Cheers,
Holger

Yeah, this was a pretty nasty regression, and with the testing we did with the slave prior to releasing RC3, I really don’t know how this wasn’t detected.

Wednesday should be the latest we get RC4 released, so it might be best to simply downgrade to RC2 until we can get RC4 out. I think we’ll even remove the download link for RC3 because of this regression.

Cheers,
Ryan

I see. Maybe you can give me a quick update later today whether it will be wednesday or maybe earlier - in case you will be able to estimate it by the end of the day. Our day here is almost over so in case you think it’s gonna be out tomorrow i might just skip the downgrade.

Cheers,
Holger

Just chatted with a few people here, and it will more than likely be Wednesday, so it’s in your best interested to downgrade so that you can get your renders done tomorrow.

We’re really sorry for this inconvenience.

Alright. Don’t worry. All fine. Will downgrade tonight then.
Cheers,
Holger

Just an FYI that Deadline 7 RC4 has been released!
viewtopic.php?f=204&t=12744

Cheers,
Ryan

Alright. Will install it later today and report back.
Thanks!

Holger

Hey Ryan,

just wanted to let you know that we’ve been running RC4 for 24hrs now and so far the issue didn’t show up again. knocking on wood :wink:

Cheers,
Holger

Thanks for the update! Fingers crossed… :slight_smile:

hmmm… so far everything went fine. At 17:17 i got 6 mails about stalled Slaves. But when i checked a few minutes later (didn’t see the mails immediately) they were either offline or rendering. So basically all was fine. I guess this is maybe due to the machines being so occupied by the rendering that they sometimes don’t update their status properly? Which can then be worked against by increasing the limit for Slaves being marked as stalled in the Slave settings?
This actually brings me to another question: when we were still rendering with Backburner - i.e. “The Dark Age” :wink: - we set it to start with the lowest priority possible, ‘Low’ instead of ‘Below normal’. This helped dramatically still being able to access the machines without any issues through RDP and also processes parallel to rendering processes were running much more responsive and smooth. This is mostly owed to V-Ray actually as it likes to eat CPUs for breakfast. Deadline Slave seems to be running on ‘below normal’ by default despite us starting it through a wrapper .bat that actually starts it with priority ‘low’. We also noticed a bit less optimal behaviour when e.g. making RDP connections to machines that are currently rendering with Deadline compared to them running on ‘low’ prio when they were rendering with the old setup. Is there any way to set the priority for the Deadline Slave process to ‘low’ and thus also all the processes it launches?

Cheers,
Holger

Hi Holger,

The individual plugins have a self.ProcessPriority property, where you can set the priority you would like. By default its BelowNormal as you mentioned for most of the plugins.

Hope this helps,
laszlo

hey Laszlo,

i see. Am i right assuming that this is nothing i can set in the “Repository Options” through the Monitor but only in the plugin’s .py file? So how can i keep this setting through a Repository update? Won’t that be overwritten again then?

Cheers,
Holger

That’s correct. You need to set it at the plugin level via code, and you’ll have to remember to re-apply this change every time you upgrade the Repository.

I’m not sure about the idea of having a global override in the Repository Options (one size doesn’t always fit all), but maybe we could support overrides for some of these settings in the dlinit file. This file doesn’t get overwritten during an upgrade, it’s simply “merged” with the new version. That’s why your render executable paths, for example, don’t get wiped when doing an upgrade. We could have overrides for process priority, popup handling, stdout handling, etc. Although there are some plugins that run more than one process, so this might not be an ideal solution for those types of plugins…

Hmm… not sure what the best solution would be here.

Cheers,
Ryan

Could you use the “custom” plugin directory to use your own ‘custom’ version of one of our plugins? Upgrade the repository when you like, but your custom one would still be used and then carry out a diff/merge if you wish to update your plugin with the latest shipping Thinkbox version? (We don’t overwrite any custom versions during a repo upgrade. Although we do backup this directory to be helpful.)

Privacy | Site terms | Cookie preferences