Deadline 8.0.7.3 Slaves Turning Off

rheinspiel · August 17, 2016, 4:13am

We just recently switched to Deadline 8 and are having issues with our slaves turning off and then have to physically log into the machines to reactivate the slave as it will ignore restart slave commands from Deadline. I have activated restart slaves on Stalls. Could they be timing out? As we are not getting any errors saying why they shut down.

Thoughts?
Austin Reed

dwallbridge · August 17, 2016, 8:39pm

Do the slave logs have any indication why the process is exiting? If not, is there any OS level logging of the process crashing? Any closing of the app, even a time out, should be recorded in the slave log.

rheinspiel · August 18, 2016, 2:56pm

The slave is showing two errors trying to run 3ds Max 2016

Error: RenderTask: Unexpected exception (Monitored managed process “3dsmaxProcess” has exited or been terminated.
STALLED SLAVE REPORT

Current House Cleaner Information
Machine Performing Cleanup: KCHD012
Version: v8.0.7.3 Release (f33fcb7d3)

Stalled Slave: KCHW722
Slave Version: v8.0.7.3 Release (f33fcb7d3)
Last Slave Update: 2016-08-18 00:58:56
Current Time: 2016-08-18 01:09:38
Time Difference: 10.713 m
Maximum Time Allowed Between Updates: 10.000 m

Current Job Name: MS_RV_DV
Current Job ID: 57b4cd77742fb1168cb1db9f
Current Job User: dsanchez
Current Task Names: 247
Current Task Ids: 247

Searching for job with id “57b4cd77742fb1168cb1db9f”
Found possible job: MS_RV_DV
Searching for task with id “247”
Found possible task: 247:[247-247]
Task’s current slave: KCHW722
Slave machine names match, stopping search
Associated Job Found: MS_RV_DV
Job User: dsanchez
Submission Machine: KCHD276
Submit Time: 08/17/2016 15:47:50
Associated Task Found: 247:[247-247]
Task’s current slave: KCHW722
Task is still rendering, attempting to fix situation.
Requeuing task
Setting slave’s status to Stalled.
Setting last update time to now.

Slave state updated.
…

Is it possible that since we didn’t update her submissions scripts there is a discrepancy going on?

PS: We have these same errors showing up on the other 40 machines

Thanks,

Austin Reed

dwallbridge · August 18, 2016, 9:05pm

Hello,

So the second log you sent is just a simple stalled slave report. It means the slave did not update the database within 10 minutes of it’s last report to let the database know what was happening. This often means the slave crashed or became disconnected from the network.

As for the first one, we would need to see the full job report in order to better answer on that issue, as this is too general an error.

rheinspiel · August 19, 2016, 3:37pm

So about 2/3 of the way through the animation all of the machines “Stalled” and only gave me the same type report I posted yesterday. Where else should I look for other reasons why it turned itself to “offline”? Is this a permission issue with the repository sitting out on the network? The only real thing different in these animations is that we are playing with Chaos Phoenix and each frame it has to go out and grab the calculated data from the network. Can this be effecting it?

PS: This round of the slaves stalling and turning themselves off we didn’t get any other errors as show in the attached image.

Thanks,
Austin Reed

rheinspiel · August 23, 2016, 2:43pm

Is there another fix where after these machines “stall” where I can still use the ability to turn on the slaves via deadline? Whenever we try to enable them via deadline we get this error:

Machine Name Command Time Stamp Status Results
KCHW785 LaunchSlave ~ 2016/08/23 09:41:01 Failed A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond 10.1.70.85:17080

PS: I have the firewall open for 17080 on all of the machines as well…We have some deadlines approaching and any help would be appreciated in trying to keep our render farm from “stalling”.

Thanks,
Austin Reed

dwallbridge · August 24, 2016, 8:50pm

Hello,

So stalls get marked as stalled when their time since last update is greater than the limit set in their repo. What I would like to do is turn on verbose slave logging in the repository options so that we can see when the slave stops updating with the database, so we can see if there is something we can point to as the cause of the info thread dying.

Also, next time these machines stall, can you remote into one and see if deadlinelauncher and deadlineslave are still running? I am curious if the apps are dying, which would prevent them from accepting the start slave remote command. The launcher is the application that accepts the command, so we want to make sure that it is running to accept those commands.

Cheers,

Dwight

rheinspiel · August 24, 2016, 9:25pm

Dwight,

I turned on the verbose slave logging in the repository and have submitted a render to the farm for the evening. I will remote in from home tonight to make sure its still rendering. If it fails, where do I find the verbose log? I will also remote into the slaves if they “stall” and see if the slave and launcher are still running. The only thing I see new from 7 to 8 is this “House Cleaning” procedure and it seems to be doing its thing shortly after each frame is completed. Is this correct?

Austin Reed

rheinspiel · August 25, 2016, 2:28pm

Dwight,

So we submitted two new jobs and re-queued two jobs that threw 40+ stalled errors and last night everything rendered with throwing a single error. We will continue to hammer the servers and see if we can get it to act up again throughout the day. Thanks again for the suggestions and help and I will keep you posted.

Thanks,
Austin Reed

dwallbridge · August 25, 2016, 7:06pm

For sure, let me know what you find.

rheinspiel · August 26, 2016, 2:58am

Dwight,

Well the slaves turned themselves off again. Here is the log:

https://www.dropbox.com/s/xh1mx5lcpcl393u/deadlineslave-KCHW701-2016-08-25-0000.log?dl=0

Thanks,
Austin Reed

eamsler · August 29, 2016, 9:32pm

That log shows the Slave happily shutting down. That one in particular, I believe is the application was closed as you would any other application.

Here’s an example from my Slave log when I closed it using the ‘x’ in the title bar:

2016-08-29 16:28:53:  Triggering house cleaning events...
2016-08-29 16:29:09:  Info Thread - requesting slave info thread quit.
2016-08-29 16:29:10:  Scheduler Thread - exception occurred:
2016-08-29 16:29:10:  Info Thread - shutdown complete
2016-08-29 16:29:11:  Scheduler Thread - shutdown complete
2016-08-29 16:29:29:  Slave - Final cleanup
2016-08-29 16:29:38:  Slave - Shutdown complete

I need to see where the “Slave - slave shutdown: normal” is coming from. I’m tempted to say the Launcher connected to shut it down, but that should show a connection message.

Update: “shutdown: normal” is from the Slave’s own Shutdown() function. Just for the sake of checking, can you send along the Launcher log from about the same time on the same machine? If it was an idle shutdown, it should show Pulse connecting in and asking the Slave to exit.

tsmithf · August 30, 2016, 4:48pm

We have had the slaves DeadlineLauncher closing for a while under Deadline 7 or 8, this is the essential program to keep running for most Deadline functions.

It seems to happen after particularly heavy resource renders.

We found the best way to get around it is to use a program to monitor if DeadlineLauncher is still running, we use Restart on Crash (http://w-shadow.com/blog/2009/03/04/restart-on-crash/) which has been very reliable and fixed the issue completely, and the related power management control, i.e. slaves not shutting down after the idle period because they’re not listening for the command.

Alternatively if you run DeadlineLauncher as a service you can do the same in the service configuration.

eamsler · August 30, 2016, 9:33pm

So, this application might be what’s causing the Slave to close gracefully because the Slave has hung?

tsmithf · August 31, 2016, 8:21am

Not sure about that?

The slave can run without the Deadline Launcher running, it’s just commands sent to the machine won’t be received without the Deadline Launcher running, start/stop slave, shutdown machine etc.

eamsler · August 31, 2016, 2:19pm

Hmm. Let’s take a step back here. You said that “the slaves turned themselves off again”. Here’s the info we’ve got so far:

From the logs, it looks like they went through their usual shutdown process as though they were asked to close (no crash here)
Launcher isn’t being run? If that’s the case, that rules out the major components that would normally shut down the Slave.
Perhaps the Slave have been closed by that app you linked that would close hung processes?

tsmithf · August 31, 2016, 3:12pm

I think we may have got our wires crossed here, I’m merely suggesting an app to fix the issue of the Deadline Launcher crashing and hence preventing sending the command to start/restart the slave, as per the problem indicated by rheinspiel above…

The restart on crash app is purely designed to monitor if the Deadline Launcher has crashed and restart it if required.

Critically the reason we need to check if the Deadline Launcher is still running is we lose the slave start/stop/restart capability without it, and any other control such as the power management processes from Pulse such as machine shutdown/restart.

The Deadline Launcher has a tendancy to crash after heavy renders, hence we installed the simple app to check and restart it if required.

The app has no effect or control of the slaves, however I guess you could also put it on the watch list for the DeadlineSlave as well.

Apologies if I’ve confused the thread

rheinspiel · August 31, 2016, 3:28pm

Edwin,

Sorry for the lack of response was trying to get a deadline completed yesterday before I could get back to solving this. I have put the deadlinelauncher logs from the day before, day of and day after as the deadlineslave report I sent and they can be found here: https://www.dropbox.com/sh/y2dw56ij4y098bw/AAD0LKzjmW9GO2z1imFfOXs7a?dl=0. We are not manually shutting down the nodes so it does seem that after some heavy rendering something is causing it to time out and running the shutting down function. Do I need to look in the repository settings for some timeout configurations that I might have overlooked from moving to Deadline 8?

@TSmithf: I will look at using that software to restart deadline launcher if it shuts down, but hoping we can figure out a solution as to what is actually causing this.

Thanks for all the help,

Austin Reed

eamsler · September 1, 2016, 2:02pm

@tsmithf: Heh! Yup, I got myself confused

As for the Launcher logs, they aren’t showing much in the way of communication with the Slave. I’ll see if someone from core can take a look and provide some guidance.

Also, could you check your idle shutdown settings in power management?

eamsler · September 6, 2016, 6:54pm

So, I’ve bounced the idea back and forth with a member of our dev team, and we’re both expecting logs explaining that things are shutting down. It’s certainly not a crash here…

Would there be any way to have a call? I’m here in Winnipeg (Central Canada) and I’m in from about 9-5 every day. Is there some time that might work out so we can take a look first-hand?