We’re setting up a new render farm with 78 nodes, but seem to be having trouble with using the Remote Administration functions of Deadline (Start/Stop/Restart Slave, Shutdown, Reboot, etc). We also get an odd error with using Wake on LAN from the Monitor as well.
When using Wake on LAN, the Monitor gives this error for every machine we asked it to start:
“A socket operation was attempted to an unreachable host (System.Exception)”
However all the machines are sent Magic Packets and the do wake up. Annoying error box…
The main failure is that we cannot consistently remote admin the machines despite having set the Repository options to allow it. We’re trying to shutdown the entire farm (or parts of the farm), but we get this error for each of the machines:
“Render09: Failure: cannot accept a connection because Remote Administration has been disabled under the Launcher Settings in the Repository Options”
If we wait a few minutes, sometimes it will magically work if we try again. Other times, it never works until we VNC to the machine and manually exit the Launcher and reload it from the Start menu. On really good days it may work for one of the machines without having to do anything (working as intended).
We’ve confirmed many times that the “Enable Remote Administration” is True in the repository.
This is using Deadline 3 SP1. “3.0.33353 R” is the Launcher Version.
I wonder if the first problem could be a DHCP or DNS issue. I’ve never seen this specific WOL issue before, nor can I reproduce it now. We have improved our WOL feature in our internal working version to make it more reliable, so maybe it might work better for you in the next release, but I really can’t say for sure.
The only reason I can think of that the Launcher would think that remote administration is disabled when it’s not is if it loses its connection to the Deadline Repository, and thus all network settings it uses are reverted to their defaults (in which case remote administration would be disabled). Can you check the launcher logs on one of the machines to see if errors are being printed out? Also, the launcher would print out if and when it is toggling the remote administration option. If you see it being set to enabled, then to disabled, then enabled again, it could be that it’s periodically losing it’s connection to the repository, which could indicate a network problem. If you’re on XP, these logs can be found in C:\Documents and Settings\All Users\Application Data\Frantic Films\Deadline\logs on the client machine.
Ok we found something odd in the launcher logs on each of the render nodes:
2009-01-20 20:26:29: BEGIN - RENDER23\render
2009-01-20 20:26:29: Start-up
2009-01-20 20:26:29: 2009-01-20 20:26:29
2009-01-20 20:26:29: Deadline Launcher 3.0 [v3.0.33353 R]
2009-01-20 20:26:29: Launcher Thread - Launcher thread initializing...
2009-01-20 20:26:29: Launcher Thread - Remote administration is disabled
2009-01-20 20:26:29: Launcher Thread - Launcher thread listing on port 5042
2009-01-20 20:36:30: Remote Administration is now enabled
It takes exactly 10 minutes for all of these render machines to enable their Remote Administration feature. A few of our X58 workstations enable it immediately on boot up. Very strange.
Could it be related to using dual NICs on these machines and having the launcher either loading before the network comes up fully, or the launcher is looking on the wrong NIC for the repository then failing and trying again in 10 minutes? The internal network (which the repository lives on) is the second NIC listed in the system.
On a side note: the entire network is statically assigned IP addresses in a completely closed setup. No external internet access, no routers, just one switch for testing. DHCP and DNS are not used on the network for now. These will come later after the Render Farm is finished and ready for prime time.
It’s likely the former - the launcher is booting up before the network comes up fully. The launcher checks if the repository is online every 10 minutes, so this would explain the problem you’re having. Maybe what we could do in a future version is to have the launcher copy over the network settings whenever it checks them (every 10 minutes). Then if the launcher boots up and it can’t access the repository, it uses the local copy of the network settings until it can access the repository’s settings again.
Know any handy tricks for delaying the Launcher’s startup by a few seconds at boot time?
Or maybe tricks for having Windows wait until the network is ready before logging in automatically?
Also, getting back to the Wake-on-Lan error, our network is in the 192.168.0.X range, 255.255.255.0 subnet. The error Deadline Monitor was giving was saying it couldn’t send to 192.168.255.255:9. Shouldn’t it be trying to broadcast to the 192.168.0.255 range? That :9 would indicate it’s trying to broadcast in the 255.128.0.0 subnet mask range.
Oh I’m kicking myself right now. I read how to do this on a forum this week and thought to myself “That’s a neat trick. I’ll have to remember that.”
I have subsequently forgotten.
But I can say definitively there is a way to tell an application in the startup list to “wait for network” before launching… hmmm maybe this will help: support.microsoft.com/kb/Q305293