problems and annoyances regarding Power management, scheduling and WOL

hummelhans · March 22, 2018, 6:49pm

A short introduction to our situation
Currently we have more render capacities on our Workstations than on the dedicated farm thus they should be used as effectively as possible and because people are forgetting, also automated. Meaning the artists come to work, don’t worry about deadline, the farm or anything other than their actual tasks and shut down the workstation when they leave. The rest of the time the Workstations should wake up and render when needed, and of course shut down when idle.

Trying to make deadline do all those things for us i stumbled across some things regarding Power management (PM), WOL and slave scheduling that are disturbing me for quite some time now and which I could not resolve so far or did “unelegant” workarounds for.
I have to add that these experiences accumulated over the last 15 months and we just recently updated from version 7 to 10. Since most of these problems seem to be there still, I thought I better adress them now. hopefully some of them can be solved.
I know this is more then one topic but its all related to that big picture.

ureliability of WOL
While most of the workstations out of office hours fire up when they are needed some won’t (hardware is almost identical).
And its not like those PCs don’t WOL at all. Sometimes it works fine. A whole week they might just do what they are supposed to and then suddenly you can not wake them up. Sometimes its the automatic power management WOL signal that does not work.
In those cases a manually induced WOL signal (via the DL monitor) will do the trick. But well, you have to be there and check if there is a WS not starting up when it should.
Sometimes even the “manual” remote command won’t work. You notice the difference when that popup “Wake On Lan broadcast sent” does not appear.
When you start them by actually pressing the power button and shut them down after the OS booted, often they can be started via WOL again (until the problem reoccurs).
Seemingly unnessesary PM startups
Right now it seems prior to a PM triggered WOL Deadline only checks if a deadlineslave is running.
What it should do imho is to check if that machine is running.
Would it be possbile to do so? The current state is annoying because
a) it clutters the power management history with hundreds of entries trying to wake up machines that are already running, which make it ultimately harder to read for debugging.
b) it contradicts this handy checkbox to start the slave after a WOL signal. Because it is impossible to block it like one can do for the idle shutdown or the slave scheduling it keeps firing, making this option not usable for a Workstation.
It would be the ideal thing If deadline would check for any machine that needs to be woken up and then start the machine and the slave exactly once.
the workaround not working sometimes
my current workaround is to not use that checkbox mentioned earlier. Instead I put the workstations in the slave scheduling running 24/7 but block the slave startup with a specified process which I put in the startup folder. So as long as this process is running the artist can work.
of course occasionally someone will accidentally close that process and you can imagine what happens next.
But I also confirmed the slave starting up sometimes although that specified process is running all the time and thats really frustrating for the artists and me trying to calm them down.

What I wish for would be a checkbox “only start slave is user is logged in” so any other user than our render user would not have to deal with a slave running when it should not.

WOL and VPN
Some artists prefer to work from home or do something over weekends remotely, or even live in another country now. But this is a problem when their workstation has been shut down from the PM. We provided them with a VPN access to our fileserver which is also the host for the deadline pulse and had them install deadline on their home terminal. The plan was to use the deadline monitor to start their WS and then use the remote desktop utility. The deadline monitor works and you can see all the jobs and slaves doing their thing. But the remote WOL will not work somehow although they are practically within the local network. Now i saw that with DL 10 there is this redirect remote commands through pulse option which sounds like the solution but it wont help. Actually after activating that checkbox i get this error message:
“cannot accept a connection because Remote Administration has been disabled under Client Setup in the Repository Options”
I double checked. I didnt deactivate the remote administration.

anthonygelatka · March 23, 2018, 10:03am

+1 for this

I mailed about adding the schedule functions to the power functions.

Deadline 10 has this new feature for Remote Control, you need to add machines to a whitelist or disable the white list
Configure Repository Options > Client Setup > Remote Control

eamsler · March 23, 2018, 2:42pm

Thanks for the write-up! Here are my notes:

Ureliability of WOL:

We actually had this be a hardware problem 10+ years ago. Depending on how the machine was shut down, it might not wake properly.

Talking through other problems, it could be that the packets aren’t making it to the right places.

Check if the machine is running:

That’s not a bad idea. We have code which can ping for the machine. Currently we do just check the Slave state and try starting the machine. I can add that as a request and we can use that as the first-line of testing. If it does ping we can avoid sending the WOL at all. I’ve opened an issue for that one.

For starting the machine exactly once, that’s a little tricky as we don’t store state across runs. The other problem is deciding when it’s been long enough to try again. Is a day okay? Maybe a month? That one’s a little more nuanced.

Failed workaround

Why not use the Slave scheduling feature? Can you explain how you’re blocking their startup? Idle detection is a bit problematic given that the Slave is likely not running as the active user. You can also enable/disable the Slaves while they’re running via the API or DeadlineCommand.

I’d prefer seeing us fix the idle detection feature to work across all users of a workstation and have it report login / mouse activity. I have some ideas here, but we’ll have to see where we get with them. Some user on Linux though have built scripts that replace the xidlewrapper that do exactly that, check if users are logged in.

WOL over VPN

Because Wake-on-LAN uses broadcasts they are usually filtered out by your routing protocols. It’ll depend on the VPN solution, but if you do enable forwarding through Pulse that should work as the broadcasts will be sent from inside the right IP subnet.

Lots of good stuff here! I think limiting the number of needless wakeup messages is a great idea. I think what we should do in this thread is make the Slave behave when people are working on their machines.

hummelhans · March 28, 2018, 10:47am

First of all thank you for taking a look at this matter!

Maybe you could give me some directions on how to further investigate?
I am actually not an IT guy. Just the only one willing to take care of this .

The only thing I found so far was something called ARP. It seems some routers flush their arp table after some time making it impossible to reach the mac adress they are supposed to send the magic package to.

thanks alot! If the ping approach gives troubles my next best idea would be to check for a running launcher or launcherservice.

I am actually using the slave scheduling for my workaround. just not the way it was meant to be (i suppose).
You might think: just set it to after work hours like 8pm to 6 am but its not that simple.
First of all there is no such thing as working hours here . Also once a company reached a certain amount of employees
there are always some people sick or on vacation or doing errands. In that event their Workstation should be available for rendering.
Since I cannot possibly know when who will be there this is not a satisfying solution.

I guess you mean that CRTL+D shortcut
Yes, but for the same reason as above I can’t just deactivate all WS slaves in the morning. I’d need to walk through the office
mutliple times a day checking who is there. Plus i wanna do a vacation too someday, so it needs do be automated .

i personally don’t care too much about idle detection, because usually our renderjobs would take longer than someone taking a short break. Also it is a much more controlled situation if someone powers down his machine. thats the signal to start rendering and there can’t be an important document still open or anything else interfere with his work.

Im using the process override under miscellaneous options in the scheduling options.
Also i’ve placed two items in the respectives users startup folder.
the override process (deadline monitor) and a slave with the -shutdown parameter to ensure an running slave will be killed as soon as the user logs in.
[attachment=2]scheduling.JPG[/attachment]
[attachment=1]stopDL_1.jpg[/attachment]

unfortunately it doesnt. I also have no clue why the error message states that i deactivated the remote administration.
For now we have a machine here always running on which you could remote connect and WOL any other machine in the office.
[attachment=0]redirect.JPG[/attachment]

wmad · March 29, 2018, 4:12pm

Hi,

+1 for the above!

I’m glad to see this issue has been raised, I was beginning to think I was the only one!

We have a similar situation, albeit in an architecture firm. We would like to be able to use free workstations for rendering jobs and distributed rendering. However, we are environmentally conscious and are keen to minimise the amount of machines on at any time. We would also like this system to be automated as much as possible, making the most of people’s workstations if they are ill/on holiday or just afk.

The 150+ employees here have personal/fixed workstations so our current strategy would be to ask everyone to leave their workstations logged in and on when they leave, only locking the screen (screenlock kicks in after 30min anyway).

After 35min of idle time, the slaves will start on the workstations, thereby joining the farm.

After another set period of time (15min?), Power management would kick in and suspend machines that aren’t required.

Up to this point, no issues, everything works as desired.

However this is no longer the case once Pulse tries to wake up machines.

While WOL seems to work perfectly fine, with machines responding well, Pulse makes no distinction between slaves that have been put into idle shutdown by Power Management and slaves that are shutdown because the workstation is currently in use (we use idle detection and processes for slave scheduling).

This effectively creates a loop where PM will keep trying (and fail) to wake up the same workstations… If PM would recognise that those slaves had failed to start, perhaps it could cycle through different ones rather than trying to wake up the same ones for ever…

This is particularly frustrating as each salve has an offline message which helpfully makes that distinction!

Maybe the ‘machine start-up’ tab in PM would have a checkbox: ‘only start machines that have been put into idle shutdown by power management’. Or at least PM would have feedback on the success of it WOL packets.

Looking forward to exchanging on the above, as well as the management of VPN remote desktop and PM strategies.

Good to see you guys are looking into it anyways.

All the best,

W

eamsler · March 29, 2018, 5:13pm

The loop is an interesting problem. We had a dev issue for dealing with a similar problem where if you physically removed say, five machines from the queue it would loop on that case. I’ve added your use case there for extra context. I wanted a setting for skipping machines if they’d been asked to start X times in Y minutes.

I think there needs to be some design conversation on how we tie workstation idle detection with machine startup / shutdown… They’re fairly well linked. This topic is different than the original post and needs its own thread.

The hardware problem with WOL packets was within the system so we weren’t able to fix it. I would find it these days using a tool called tmpdump (Wireshark does similar on Windows) and monitor that packets were arriving to the machine but that it wouldn’t wake when it was offline. The ARP routing table there can be the problem if your network is filtering broadcasts. The routing table tells the network gear where data should go based on the destination IP/MAC address combination its aware of.

Does anyone on hummelhans’ side manage the network gear? Pulse is trying about four different sets of addresses to send that magic packet:

Sending directly to the IP address (unicast)
Sending to the global broadcast address (255.255.255.255)
Sending to the specific broadcast group for the subnet (10.0.255.255 for a 10.0.0.0/24 network)
Naively to every octet in the current address (10.255.255.255, 10.0.255.255, 10.0.0.255)

If Pulse isn’t on the same subnet as the machines its trying to wake up, it is likely that something is filtering those packets out. I don’t have a recommendation for a simple WOL packet sniffer. I’ve considered writing my own as a hobby project, but haven’t yet as others exist online.

For the “deactivated the remote administration”, Launcher will refuse to accept commands if it can’t reach the database. I wonder if Pulse is doing the same. Can you send the error message from the Monitor as well as check the Pulse log? I’m not sure if it’ll be logging anything there as I haven’t played with it yet.

hummelhans · April 16, 2018, 11:34am

Unfortunately not. I’ve been trying to get some experts here to fix the WOL situation.
The Wireshark hint was pretty enlightening though .

Strangely when I tried to replicate the error message by WOLing my Workstation from home as I did before, it did not show up.
Didn’t wake up my Workstation though.
Sending the remote command from inside the office still worked out.
So even if it gets redirected through pulse there is a difference between sending the command from a Monitor thats connected via VPN
and a Monitor thats really in the office.

eamsler · April 16, 2018, 5:54pm

I wonder if we missed proxying a command over… I’ll open an issue for it.

TokeJepsen · July 24, 2019, 9:21am

Are there any news on getting idle detection across users?