Various Issues with slaves

celluloidvfx · October 7, 2014, 2:41pm

Hi guys,

not sure whether this is actually a Deadline7 beta issue or a mistake on my behalf.
As they are all somehow related to the slave functionality i also put them in this single thread. Hope that’s ok.
1.) There are a few machines that just don’t wake up. I’m not sure what could be the reason here. They belong to the correct pools and groups, just like the other machines/slaves that work without issues. I also checked the MAC addresses registered in Deadline. They’re also correct.
2.) Many times when Deadline shuts down the machines it doesn’t need anymore it marks some Slaves as ‘Stalling’ but the machine(s) powered off quite a while ago
3.) Today we had the situation on an artist’s machine had the Deadline Launcher running while he was working. All of a sudden the Slave on that machine launched despite the option “Launch Slave when Launcher starts” was switched off. Even using the “Only start if Launcher is not running as these users” with the currently logged-in user did not help.

As i wrote above i’m not sure whether those could be bugs or our inexperience with Deadline in general. Because of that i’m also not really sure where to look for any wrong settings. I’m sure you probably need some more information from my side - i just don’t know what’s important in those cases.

cheers,
Holger

rrussell · October 7, 2014, 3:02pm

Hey Holger,

I’ve included responses to the individual issues below. First question though: which operating system(s) are your slaves running on?

The Pulse log should show if the “Wake Up” message is being sent to these machines. If it is, the problem could be with the machines themselves (ie: WakeOnLan disabled), or the way their network is set up (ie: they’re on a separate switch and the “Wake Up” message doesn’t make it to them). A quick way to test if the Pulse machine can wake up the machines is to run the Deadline Monitor on them, and manually send the wake up messages by right-clicking on the slaves in the slave list while in super user mode and selecting Remote Control -> Machine Commands -> Start Machine.
Sounds like the slaves aren’t closing gracefully, and therefore aren’t marking themselves offline before they exit. Can you post a log from a slave that this happened to? The log would need to be from the session where the slave didn’t close cleanly.
Was this machine in a Power Management group that was meant to be woken up? Currently, Pulse sends both a WOL message and a “Start Slave” command to slaves that are offline. We’re actually adding a new Machine Wake Up option in beta 5 to disable sending the “Start Slave” command. Actually, the launcher log on the machine will probably show what triggered it to start the slave.

To find logs, you can use Help -> Explore Log Folder from any of the Deadline applications, or from the Launcher menu.

Thanks!
Ryan

celluloidvfx · October 7, 2014, 4:25pm

hi Ryan,

we’re on Windows 7 only at the moment.

1.) WakeOnLan is defnitely enabled on all those machines and also works properly. I can say for sure as we still have our current farm management software running and we’re able to wake those machines just as we have always been doing.
I also just did what you suggested and tried waking one of those machines through Deadline Monitor. This also was successful.

2.) I will keep an eye on this and at the next occurrence of this problem i’ll grab the latest log file and post it. I’m just not 100% sure anymore which machine(s) were the problematic ones so i want to make sure to send you 100% surely logs from the machines that had the issue.

3.) I think your suggestion that it’s because of the combination of a WOL and a “Start Slave” command is the reason here. We just checked the log and the machine did indeed receive the command to start the slave. But should the “Start Slave” command still be ignored by the Launcher as we had the “Only start if Launcher is not running as these users” (set to the artist’s username)? In any case the option coming in beta5 sounds like a very reasonable addition.

cheers,
Holger

rrussell · October 7, 2014, 5:00pm

Thanks for the info!

Definitely check the Pulse logs then. Since it works from the Monitor when it’s running on the Pulse machine, we know that Pulse itself should have no issues waking up the machine (they share the code for this). So it’s probably a case of Pulse not sending the wake up command in the first place.
Sounds good.
The “Only start if Launcher is not running as these users” is strictly for the idle detection feature. It doesn’t prevent the slave from being started up in other situations (ie: the artist could still launch the slave manually if they wanted to).

Cheers,
Ryan

celluloidvfx · October 7, 2014, 10:08pm

Can you tell me what to look for in the Pulse logs? I did a brief search for the name of one of the machines and the words “startup” and “wake” but didn’t find anything related to trying to wake up those machines.
I found this in the Pulse log:

2014-10-07 13:41:44: Slave 'CELL-RS-22' has stalled because it has not updated its state in 10.283 m. Performing house cleaning... 2014-10-07 13:41:44: Could not find associated job for this slave. 2014-10-07 13:41:44: Cannot send notification because the Primary SMTP Server has not been configured in the Repository Options. 2014-10-07 13:41:44: Stalled Slave Scan - Cleaned up 2 stalled slaves in 1.969 s 2014-10-07 13:41:44: Stalled Slave Scan - Done.

Unfortunately there’s a gap in the Slave log files of that machine. The first one of that day ends 13:31 with

2014-10-07 13:31:24: The license file being used will expire in 52 days. 2014-10-07 13:31:26: Info Thread - requesting slave info thread quit. 2014-10-07 13:31:27: Info Thread - shutdown complete 2014-10-07 13:31:27: Scheduler Thread - shutdown complete

and the second one continues with

2014-10-07 14:07:30: BEGIN - CELL-RS-22\render 2014-10-07 14:07:30: Deadline Slave 7.0 [v7.0.0.33 R (d065b23bf)] 2014-10-07 14:07:34: Auto Configuration: No auto configuration for Repository Path could be detected, using local configuration 2014-10-07 14:07:38: Info Thread - Created. 2014-10-07 14:07:41: Trying to connect using license server '27000@cell-lic-02'... 2014-10-07 14:07:41: License obtained.

This the content of the Launcher log from right before the shutdown:

2014-10-07 13:31:22: Updating Repository options 2014-10-07 13:31:22: - Remote Administration: enabled 2014-10-07 13:31:22: - Automatic Updates: disabled 2014-10-07 13:31:26: ::ffff:192.168.1.13 has connected 2014-10-07 13:31:26: Launcher Thread - Received command: ShutdownMachine 2014-10-07 13:31:26: Sending command to slave: StopSlave 2014-10-07 13:31:26: Got reply: CELL-RS-22: Sent "StopSlave" command. Result: "Connection Accepted. 2014-10-07 13:31:26: " 2014-10-07 13:31:27: No Monitor to shutdown 2014-10-07 13:31:27: No Pulse to shutdown 2014-10-07 13:31:27: Launcher Thread - Responded with: Success|

EDIT/UPDATE
This morning (basically just now) the whole farm was powered off as was expected as all the jobs were done. Two machines (cell-rs-17 and cell-rs-19) were marked as “Stalled”. I attached a zip with the logs from the session before the shutdown and the new session from this morning of cell-rs-19.

Understood. Do all the other limiting options for the slave not apply to that situation we discovered either?

cheers,
Holger
logs.zip (15 KB)

rrussell · October 8, 2014, 3:47pm

There should lines in the log prefixed with this:

Power Management - Machine Startup:

Just realized though that you’ll need Pulse Verbose Logging enabled, which you can do from the Application Logging section of the Repository Options. After enabling it, restart Pulse so that it recognizes the changes immediately.

Hmm, based on the logs, the slave appears to have shut down gracefully. I’ve logged this as a bug and we’ll run some tests here to see if we can reproduce.
That’s correct. They only apply to the Idle Detection feature.

Cheers,
Ryan

celluloidvfx · October 8, 2014, 7:51pm

just enabled it. I will observe this and the next time we expect any of those machines to be woken up and it doesn’t happen will look into the Pulse log.
Ok. Are there any Windows logs i could check or provide you with?
Alright. Really looking forward to this feature to disable the “Start Slave” command when waking up machines. One thing i’m not quite sure about: wouldn’t it make more sense to just also observe the idle detection criteriae when sending the “Start Slave” command instead of having a new feature to disable it? This should not affect machines that are woken up as all the possible limitations that one can set for the idle detection don’t really apply to a machine that has just been woken up from powered off state. To me it feels like this might be the better approach as it doesn’t add yet another option to the many options that are in Deadline already. What do you think?

rrussell · October 9, 2014, 3:41pm

Sounds good.
Probably not. We’re not seeing any indication of a crash here, so I don’t think there will be anything in the event log. Out of curiosity, do you have the Launcher installed as a service on the machines where the slave sometimes shows up as stalled? I’m asking because we recently found a bug where the slave wouldn’t get closed properly when the machine is shutdown and the slave is running as a service.
We implemented it this way because we didn’t want the idle detection criteria to prevent the slave from being launched on the machine by other means (ie: power management, remote control, manually by the artist on the machine, etc).

Cheers,
Ryan

celluloidvfx · October 9, 2014, 3:53pm

No news yet, please stay tuned…
They’re not (yet) running as service. As we’re in a very early stage with our Deadline testing and experiences we’re using the autologon registry setting and are launching them as desktop applications. But we’re definitely thinking of changing those into a service.
Ok, i see.

rrussell · October 9, 2014, 3:56pm

Cool, good to know. We sill have it logged as an issue and we’ll run some tests to try and reproduce.

celluloidvfx · October 10, 2014, 12:25pm

We just had the issue again. A 3ds max job was submitted at 12:01:54. But not all of the machines expected to wake up actually did. Two seemed to not have recognized the WOL packet/call. According to the Pulse log (attached) they were among the three candidates that should have woken up:

2014-10-10 12:02:03: Power Management - Machine Startup: Checking job "wmp_wav_001_030_rndr_diamonds-closeup_v006_an " (5437ae9274fc3215d0028240) 2014-10-10 12:02:03: Power Management - Machine Startup: Checking slaves in Machine Group "all" 2014-10-10 12:02:03: Power Management - Machine Startup: - only slaves with assigned pool "rs" should be woken up 2014-10-10 12:02:03: Power Management - Machine Startup: - only slaves with assigned group "max" should be woken up 2014-10-10 12:02:03: Power Management - Machine Startup: - slaves that are candidates to be woken up based on the job's pool, group, and bad slave list: CELL-RS-09,CELL-RS-10,CELL-RS-12,CELL-RS-13,CELL-RS-14,CELL-RS-15,CELL-RS-16 2014-10-10 12:02:03: Power Management - Machine Startup: - checking slave CELL-RS-09 2014-10-10 12:02:03: Power Management - Machine Startup: - waking up offline slave CELL-RS-09 because it is required to render job "wmp_wav_001_030_rndr_diamonds-closeup_v006_an " 2014-10-10 12:02:03: Power Management - Machine Startup: - checking slave CELL-RS-10 2014-10-10 12:02:03: Power Management - Machine Startup: - waking up offline slave CELL-RS-10 because it is required to render job "wmp_wav_001_030_rndr_diamonds-closeup_v006_an " 2014-10-10 12:02:03: Power Management - Machine Startup: - checking slave CELL-RS-12 2014-10-10 12:02:03: Power Management - Machine Startup: - waking up offline slave CELL-RS-12 because it is required to render job "wmp_wav_001_030_rndr_diamonds-closeup_v006_an " 2014-10-10 12:02:03: Power Management - Machine Startup: - Maximum number of required slaves have been told to wake up for this machine group's interval 2014-10-10 12:02:04: Power Management: No machine startup notification address specified in Repository Options - cannot send notification

But even now after quite a while not all machines that are in the respecitve pool “rs” and group “max” are running. Those are all the machines with the names “cell-rs-17” to “cell-rs-22”. As you can see from the log posted above they are not in the list of candidates despite them being in that pool and group and there was also no bad slaves list for this job.

cheers,
Holger
deadlinepulse-cell-temp-01-2014-10-10-0000.log (7.5 MB)

rrussell · October 10, 2014, 1:10pm

Hmm, it might be best to set up a remote session with our support team so that they can poke around and try to figure out what the problem is. Can you send an email to support (at) thinkboxsoftware (dot) com and reference this forum thread?

Thanks!
Ryan

eamsler · October 10, 2014, 2:07pm

My guess at the moment is that the Pulse machine may be on a different subnet than those machines.

I’ll take a look for your support e-mail and we’ll carry on there.

celluloidvfx · October 10, 2014, 3:16pm

hi Ryan,

definitely not. We’re not using different networks / subnets. I can 100% exclude that.
Today i won’t have time for a Teamviewer session so we’d have to do that monday in case you still think it’s necessary.
Let me know…

cheers,
Holger

rrussell · October 10, 2014, 4:07pm

I think it’s still worth doing a remote session. It should produce results a lot quicker than going back and forth on the forums.

Please contact our support team to schedule a time. Just note that Monday next week is a holiday and our office will be closed.

Thanks!
Ryan

celluloidvfx · October 14, 2014, 4:37pm

As i just posted in the two other threads related to this issue i have missed to add those machines to the ‘all’ group in the Power Management Options. This is corrected now and i assume this should remedy the situation.
In case it doesn’t i’ll contact support regarding the remote session.

cheers,
Holger

celluloidvfx · October 21, 2014, 9:48pm

Update regarding issue #2.) which is this one:

The machine called ‘cell-ws-14’ has been in ‘Stalled’ state for 1.4 days right now (23:44, Oct 21st log file time) although it properly shut down. This was probably since some point around 14:24, Oct 20th (see attached screenshot).
I attached the Pulse and Slave logs.

Cheers,
Holger
Deadline_logs_2014-10-21.zip (2.15 MB)

rrussell · October 22, 2014, 12:52pm

Hi Holger,

We were actually able to reproduce this issue last week. The problem was simply that the slave didn’t reliably mark itself as Offline during normal shutdown. The fix for this will be included in the next beta release.

Cheers,
Ryan

celluloidvfx · October 22, 2014, 2:36pm

hi Ryan,

good to hear. Looking forward to it and glad i could help find the issue.

cheers,
Holger

celluloidvfx · November 1, 2014, 11:47am

Unfortunately, right after upgrading to beta6 this issue showed up once more. The affected machine is cell-rs-21, it must have happened around 2014-11-01/00:02:20 when the machine shut down. I noticed this noon that it was in ‘Stalled’ state for almost 10hrs in the Monitor.
Attached are the Pulse and Slave log.

Cheers,
Holger
Deadline_logs_2014-11-01.zip (425 KB)