AWS Thinkbox Discussion Forums

Power Management Suspend results in multiple conflicting slave instances

I recently rolled out suspend/wake power management to our ~20 node Windows farm, and am seeing that at any given time about half the nodes refuse to go to sleep despite having idle times way over the limit.

Pulse recognizes that the machines are over their idle time and says it’s sending suspend commands to those machines on every PM update, but they don’t listen.

When I remote to the insomniac machines I always see multiple instances of the same Deadline Slave running (either as UI when Launcher is a program and in Task Manager when Launcher is a service). This causes a number of problems:

  • Slaves don’t handle jobs properly on the affected machines - either they sit there and do nothing or they hang and don’t exit at the end of a render and don’t update the repository saying that their job is finished.
  • The slaves’ reports in Monitor show dozens of STALLED SLAVE entries, but the slave shows as idle in the slave list.
  • Slaves behave unpredictably when given Remote Control commands.
  • Slaves will get into a state where they kick out the RDP user (me) and refuse RDP reconnections for a while once I start killing Dedline Slave tasks.
  • The slaves’ dynamic UBL Licenses flap between Permanent, Dynamic, and the license error condition that shows pale pink in Monitor.
  • Idle times flap between instances in Monitor.

As a result, I find that I’m spending an inordinate amount of time playing whack-a-mole with uncooperative render nodes.

Here’s a series of screenshots that show the problem in more detail:
https://imgur.com/a/hrjAO

I don’t see anything like this in the forum archives; any ideas?

Hello,

Can you grab us some slave log files from the render nodes that are having these issues? Thanks.

Sent a full day’s logs from the slave, pulse, and license forwarder to the Support email.

Thanks for the logs! They’re really helpful here. Especially the Launcher in this case since it’s responsible for starting the Slave. The good news is that it seems to be doing the right thing for the most part by asking the Slaves to start or starting up Slaves when the machine is resumed. It does happen fairly often though:

2018-01-17 00:04:26:  Launcher Thread - Machine is about to be resumed, initializing launcher thread...
2018-01-17 00:04:27:  Launcher Thread - Machine is about to be resumed, initializing launcher thread...
2018-01-17 00:09:52:  Launcher Thread - Machine is about to be suspended, shutting down launcher thread...
2018-01-17 00:14:31:  Launcher Thread - Machine is about to be resumed, initializing launcher thread...
2018-01-17 00:14:32:  Launcher Thread - Machine is about to be resumed, initializing launcher thread...
2018-01-17 00:19:57:  Launcher Thread - Machine is about to be suspended, shutting down launcher thread...
2018-01-17 00:24:38:  Launcher Thread - Machine is about to be resumed, initializing launcher thread...
2018-01-17 00:24:39:  Launcher Thread - Machine is about to be resumed, initializing launcher thread...
2018-01-17 00:50:19:  Launcher Thread - Machine is about to be suspended, shutting down launcher thread...
2018-01-17 00:55:08:  Launcher Thread - Machine is about to be resumed, initializing launcher thread...
2018-01-17 00:55:09:  Launcher Thread - Machine is about to be resumed, initializing launcher thread...
2018-01-17 01:00:28:  Launcher Thread - Machine is about to be suspended, shutting down launcher thread...
2018-01-17 01:05:17:  Launcher Thread - Machine is about to be resumed, initializing launcher thread...
2018-01-17 01:05:18:  Launcher Thread - Machine is about to be resumed, initializing launcher thread...
2018-01-17 01:15:27:  Launcher Thread - Machine is about to be suspended, shutting down launcher thread...
2018-01-17 01:20:54:  Launcher Thread - Machine is about to be resumed, initializing launcher thread...
2018-01-17 01:20:55:  Launcher Thread - Machine is about to be resumed, initializing launcher thread...
2018-01-17 01:25:34:  Launcher Thread - Machine is about to be suspended, shutting down launcher thread...
2018-01-17 01:31:03:  Launcher Thread - Machine is about to be resumed, initializing launcher thread...

Now, if I remember correctly, the Slaves are the ones responsible for shutting the machine down in one of the logs, it does show this:

2018-01-17 00:09:41:  Scheduler Thread - Seconds before next job scan: 1
2018-01-17 00:09:42:  Listener Thread - ::1 has connected
2018-01-17 00:09:42:  Listener Thread - Received message: OnLastTaskComplete SuspendMachineIdle  :   
2018-01-17 00:09:42:  Slave - slave shutdown: normal
2018-01-17 00:09:42:  Info Thread - requesting slave info thread quit.
2018-01-17 00:09:42:  Scheduler Thread - shutdown complete
2018-01-17 00:09:42:  Waiting for threads to quit.  29 seconds until forced shutdown.
2018-01-17 00:09:42:  Error occurred when checking sandbox stdout: Cannot check if stdout is available for a process that has not been launched.
2018-01-17 00:09:43:  Info Thread - shutdown complete
2018-01-17 00:09:43:  Slave - Final cleanup
2018-01-17 00:09:44:  Listener Thread - OnConnect: Listener Socket has been closed.
2018-01-17 00:09:44:  Slave - Shutdown complete

And it looks like most of the Slaves are at least closing themselves when the Launcher is trying to shut the machine down.

I think the big problem is the disconnect between the Slaves and the Launcher somehow. At some point I think a Slave was able to re-write the “.ini” file from “C:\ProgramData\Thinkbox\Deadline10\slaves” and so the Launcher isn’t able to keep track… I’m not sure how to handle this yet.

For your symptoms, here are my guesses:

Slaves don't handle jobs properly on the affected machines - either they sit there and do nothing or they hang and don't exit at the end of a render and don't update the repository saying that their job is finished.

    I think they may be continually canceling each other's tasks because they see a rendering task while they're idle and perform a stalled Slave cleanup. This likely happens every time a new Slave starts up:
2018-01-17 00:04:39:  Slave '471-desktop' has stalled because of an unexpected exit. Performing house cleaning...
[b]The slaves' reports in Monitor show dozens of STALLED SLAVE entries, but the slave shows as idle in the slave list.[/b]

    Same as above.

Slaves behave unpredictably when given Remote Control commands.

    Not sure here. 

Slaves will get into a state where they kick out the RDP user (me) and refuse RDP reconnections for a while once I start killing Dedline Slave tasks.

    The Launcher is detecting the Slave(s) are now closed and that it can sleep the machine. It *seems* from the logs that it's doing that regularly anyway though, so you may be able to just sit with the RDP session open and get kicked out.

The slaves’ dynamic UBL Licenses flap between Permanent, Dynamic, and the license error condition that shows pale pink in Monitor.

    The pinky-peach state shows a licensing error. I saw a few notes in the logs of you running out of Deadline licenses, but it's transient and clears quickly. It only occured from 9:02am to 9:07am on two of the Slaves on the machine you provided logs for, and once at 9:53pm.

Idle times flap between instances in Monitor.

    Because Slaves are all writing their internal state over the same entry, each one's idle time is going to be different.  You're seeing that there are multiple Slaves running at once.

So… I think I need to go talk with the dev team about how the INI system currently works to stop multiple Slaves from starting. I believe I saw this once on a remote session when there was a database outage. I did find a fair bit of database connection errors in the logs and I’ll send those your way in the ticket.

Privacy | Site terms | Cookie preferences