AWS Thinkbox Discussion Forums

Deadline 8.0.7.3 Slaves Turning Off

Hello,

So the second log you sent is just a simple stalled slave report. It means the slave did not update the database within 10 minutes of it’s last report to let the database know what was happening. This often means the slave crashed or became disconnected from the network.

As for the first one, we would need to see the full job report in order to better answer on that issue, as this is too general an error.

So about 2/3 of the way through the animation all of the machines “Stalled” and only gave me the same type report I posted yesterday. Where else should I look for other reasons why it turned itself to “offline”? Is this a permission issue with the repository sitting out on the network? The only real thing different in these animations is that we are playing with Chaos Phoenix and each frame it has to go out and grab the calculated data from the network. Can this be effecting it?

PS: This round of the slaves stalling and turning themselves off we didn’t get any other errors as show in the attached image.

Thanks,
Austin Reed

Is there another fix where after these machines “stall” where I can still use the ability to turn on the slaves via deadline? Whenever we try to enable them via deadline we get this error:

Machine Name Command Time Stamp Status Results
KCHW785 LaunchSlave ~ 2016/08/23 09:41:01 Failed A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.1.70.85:17080

PS: I have the firewall open for 17080 on all of the machines as well…We have some deadlines approaching and any help would be appreciated in trying to keep our render farm from “stalling”.

Thanks,
Austin Reed

Hello,

So stalls get marked as stalled when their time since last update is greater than the limit set in their repo. What I would like to do is turn on verbose slave logging in the repository options so that we can see when the slave stops updating with the database, so we can see if there is something we can point to as the cause of the info thread dying.

Also, next time these machines stall, can you remote into one and see if deadlinelauncher and deadlineslave are still running? I am curious if the apps are dying, which would prevent them from accepting the start slave remote command. The launcher is the application that accepts the command, so we want to make sure that it is running to accept those commands.

Cheers,

Dwight

Dwight,

I turned on the verbose slave logging in the repository and have submitted a render to the farm for the evening. I will remote in from home tonight to make sure its still rendering. If it fails, where do I find the verbose log? I will also remote into the slaves if they “stall” and see if the slave and launcher are still running. The only thing I see new from 7 to 8 is this “House Cleaning” procedure and it seems to be doing its thing shortly after each frame is completed. Is this correct?

Austin Reed

Dwight,

So we submitted two new jobs and re-queued two jobs that threw 40+ stalled errors and last night everything rendered with throwing a single error. We will continue to hammer the servers and see if we can get it to act up again throughout the day. Thanks again for the suggestions and help and I will keep you posted.

Thanks,
Austin Reed

For sure, let me know what you find.

Dwight,

Well the slaves turned themselves off again. Here is the log:

https://www.dropbox.com/s/xh1mx5lcpcl393u/deadlineslave-KCHW701-2016-08-25-0000.log?dl=0

Thanks,
Austin Reed

That log shows the Slave happily shutting down. That one in particular, I believe is the application was closed as you would any other application.

Here’s an example from my Slave log when I closed it using the ‘x’ in the title bar:

2016-08-29 16:28:53:  Triggering house cleaning events...
2016-08-29 16:29:09:  Info Thread - requesting slave info thread quit.
2016-08-29 16:29:10:  Scheduler Thread - exception occurred:
2016-08-29 16:29:10:  Info Thread - shutdown complete
2016-08-29 16:29:11:  Scheduler Thread - shutdown complete
2016-08-29 16:29:29:  Slave - Final cleanup
2016-08-29 16:29:38:  Slave - Shutdown complete

I need to see where the “Slave - slave shutdown: normal” is coming from. I’m tempted to say the Launcher connected to shut it down, but that should show a connection message.

Update: “shutdown: normal” is from the Slave’s own Shutdown() function. Just for the sake of checking, can you send along the Launcher log from about the same time on the same machine? If it was an idle shutdown, it should show Pulse connecting in and asking the Slave to exit.

We have had the slaves DeadlineLauncher closing for a while under Deadline 7 or 8, this is the essential program to keep running for most Deadline functions.

It seems to happen after particularly heavy resource renders.

We found the best way to get around it is to use a program to monitor if DeadlineLauncher is still running, we use Restart on Crash (http://w-shadow.com/blog/2009/03/04/restart-on-crash/) which has been very reliable and fixed the issue completely, and the related power management control, i.e. slaves not shutting down after the idle period because they’re not listening for the command.

Alternatively if you run DeadlineLauncher as a service you can do the same in the service configuration.

So, this application might be what’s causing the Slave to close gracefully because the Slave has hung?

Not sure about that?

The slave can run without the Deadline Launcher running, it’s just commands sent to the machine won’t be received without the Deadline Launcher running, start/stop slave, shutdown machine etc.

Hmm. Let’s take a step back here. You said that “the slaves turned themselves off again”. Here’s the info we’ve got so far:

  1. From the logs, it looks like they went through their usual shutdown process as though they were asked to close (no crash here)
  2. Launcher isn’t being run? If that’s the case, that rules out the major components that would normally shut down the Slave.
  3. Perhaps the Slave have been closed by that app you linked that would close hung processes?

I think we may have got our wires crossed here, I’m merely suggesting an app to fix the issue of the Deadline Launcher crashing and hence preventing sending the command to start/restart the slave, as per the problem indicated by rheinspiel above…

The restart on crash app is purely designed to monitor if the Deadline Launcher has crashed and restart it if required.

Critically the reason we need to check if the Deadline Launcher is still running is we lose the slave start/stop/restart capability without it, and any other control such as the power management processes from Pulse such as machine shutdown/restart.

The Deadline Launcher has a tendancy to crash after heavy renders, hence we installed the simple app to check and restart it if required.

The app has no effect or control of the slaves, however I guess you could also put it on the watch list for the DeadlineSlave as well.

Apologies if I’ve confused the thread :slight_smile:

Edwin,

Sorry for the lack of response was trying to get a deadline completed yesterday before I could get back to solving this. I have put the deadlinelauncher logs from the day before, day of and day after as the deadlineslave report I sent and they can be found here: https://www.dropbox.com/sh/y2dw56ij4y098bw/AAD0LKzjmW9GO2z1imFfOXs7a?dl=0. We are not manually shutting down the nodes so it does seem that after some heavy rendering something is causing it to time out and running the shutting down function. Do I need to look in the repository settings for some timeout configurations that I might have overlooked from moving to Deadline 8?

@TSmithf: I will look at using that software to restart deadline launcher if it shuts down, but hoping we can figure out a solution as to what is actually causing this.

Thanks for all the help,

Austin Reed

@tsmithf: Heh! Yup, I got myself confused :laughing:

As for the Launcher logs, they aren’t showing much in the way of communication with the Slave. I’ll see if someone from core can take a look and provide some guidance.

Also, could you check your idle shutdown settings in power management?

So, I’ve bounced the idea back and forth with a member of our dev team, and we’re both expecting logs explaining that things are shutting down. It’s certainly not a crash here…

Would there be any way to have a call? I’m here in Winnipeg (Central Canada) and I’m in from about 9-5 every day. Is there some time that might work out so we can take a look first-hand?

We might have finally gotten to the bottom of the machines “turning off” issue. Apparently we had issues with individuals remoting into machines and forgetting to log off so our IT department redid some group policies that forced machines to log off after several hours of “idle” use. They didn’t advise me of this change as they didn’t think it would effect the render farm, but obliviously it did. We took our render node user account out of this new group policy late Friday and no machines shut down over the long holiday weekend! I am pretty confident that this resolved the “turning off” issue. Thanks again for all of the time you guys spent trying to help us out.

Cheers,
Austin Reed

I’m going to be laughing about this for probably the rest of the week. Who would have thought! :laughing:

Well, I’m glad we got to the bottom of it :slight_smile:

Guys i just posted new tread as i didn’t realise i had the same issue as this. Pretty much the same scenario
slaves stall and wont accept commands from the monitor as the launcher isn’r running so i have to manually log on, stop the DeadlineService , kill the DLprocesses that are running and restart the service

Privacy | Site terms | Cookie preferences