A short introduction to our situation
Currently we have more render capacities on our Workstations than on the dedicated farm thus they should be used as effectively as possible and because people are forgetting, also automated. Meaning the artists come to work, don’t worry about deadline, the farm or anything other than their actual tasks and shut down the workstation when they leave. The rest of the time the Workstations should wake up and render when needed, and of course shut down when idle.
Trying to make deadline do all those things for us i stumbled across some things regarding Power management (PM), WOL and slave scheduling that are disturbing me for quite some time now and which I could not resolve so far or did “unelegant” workarounds for.
I have to add that these experiences accumulated over the last 15 months and we just recently updated from version 7 to 10. Since most of these problems seem to be there still, I thought I better adress them now. hopefully some of them can be solved.
I know this is more then one topic but its all related to that big picture.
-
ureliability of WOL
While most of the workstations out of office hours fire up when they are needed some won’t (hardware is almost identical).
And its not like those PCs don’t WOL at all. Sometimes it works fine. A whole week they might just do what they are supposed to and then suddenly you can not wake them up. Sometimes its the automatic power management WOL signal that does not work.
In those cases a manually induced WOL signal (via the DL monitor) will do the trick. But well, you have to be there and check if there is a WS not starting up when it should.
Sometimes even the “manual” remote command won’t work. You notice the difference when that popup “Wake On Lan broadcast sent” does not appear.
When you start them by actually pressing the power button and shut them down after the OS booted, often they can be started via WOL again (until the problem reoccurs). -
Seemingly unnessesary PM startups
Right now it seems prior to a PM triggered WOL Deadline only checks if a deadlineslave is running.
What it should do imho is to check if that machine is running.
Would it be possbile to do so? The current state is annoying because
a) it clutters the power management history with hundreds of entries trying to wake up machines that are already running, which make it ultimately harder to read for debugging.
b) it contradicts this handy checkbox to start the slave after a WOL signal. Because it is impossible to block it like one can do for the idle shutdown or the slave scheduling it keeps firing, making this option not usable for a Workstation.
It would be the ideal thing If deadline would check for any machine that needs to be woken up and then start the machine and the slave exactly once. -
the workaround not working sometimes
my current workaround is to not use that checkbox mentioned earlier. Instead I put the workstations in the slave scheduling running 24/7 but block the slave startup with a specified process which I put in the startup folder. So as long as this process is running the artist can work.
of course occasionally someone will accidentally close that process and you can imagine what happens next.
But I also confirmed the slave starting up sometimes although that specified process is running all the time and thats really frustrating for the artists and me trying to calm them down.
What I wish for would be a checkbox “only start slave is user is logged in” so any other user than our render user would not have to deal with a slave running when it should not.
- WOL and VPN
Some artists prefer to work from home or do something over weekends remotely, or even live in another country now. But this is a problem when their workstation has been shut down from the PM. We provided them with a VPN access to our fileserver which is also the host for the deadline pulse and had them install deadline on their home terminal. The plan was to use the deadline monitor to start their WS and then use the remote desktop utility. The deadline monitor works and you can see all the jobs and slaves doing their thing. But the remote WOL will not work somehow although they are practically within the local network. Now i saw that with DL 10 there is this redirect remote commands through pulse option which sounds like the solution but it wont help. Actually after activating that checkbox i get this error message:
“cannot accept a connection because Remote Administration has been disabled under Client Setup in the Repository Options”
I double checked. I didnt deactivate the remote administration.