Hi Russell and all,
We’re soon planning on buying a few more dedicated render farm nodes, so we’re trying to pay more attention to some operational issues we’ve glossed over in the past. I’ll post each one in a separate thread.
We have 30 Render Farm hosts running WinXP x64 SP2. They auto-login to a special domain user, ‘deadline’, which runs Launcher and then Slave. Typically, no one logs into these machines, they’re 1U rack-mount servers with no graphics card or monitor. I don’t know if it’s important for this bug report, but we run Pulse on a separate machine under Windows Server 2003 (32-bit).
This thread concerns a case where we can’t contact the Slave on a host. In this case, from my local machine (also WinXP x64 SP2), I ran Monitor and selected the 30 hosts and said, Remote Control > Stop Slave (FYI, I’ve seen the exact same phenomenon I’m reporting here in the past with Remote Control > Restart Slave or Remote Control > Restart Slave After Next Job Completion). Of course I said to “wait from response from remote machines”.
Most of the machines respond normally. However, 3 machines consistently failed the remote command:
UnableToWrite.png
Ordinarily, that would make me think something is up with the network - but I can ping the machines normally, and in fact can Remote Desktop into them no problem. At that point I see something interesting: after logging in, here’s what the desktop looks like:
In particular, in the task bar you can see that no applications are open, but in the System Icon Tray the icon for Deadline Slave is still visible. That’s a very odd state, normally in my experience if Slave is running, its window is one of the windows I can switch to in teh task bar. Even more interestingly, when a Slave machine is in this state, if I move my mouse over the Slave icon in the System Tray, that icon instantly disappears.
Once we log into the machine, we can usually restart Slave from Deadline Launcher normally (sometimes we have to kill Slave). However, needless to say, what we really want is to be able to reliably administer our growing pool of machines using the Remote Control features. Right now in actual use the Remote Control features only have an effect on most, rather than all, of the machines in the farm. Today’s 10% failure rate was actually pretty good, usually it’s a little higher than that.
Any ideas about things to check or why we might be seeing this failure rate on Remote Control commands would be most welcome.
Leo