I’m seeing some major (read: show-stopping) problems with launchers and slaves, which I’ll try to describe below.
I came in this morning to find that two slaves I started yesterday (one of which is running on my machine) had stalled. Neither had picked up any work (there isn’t any), and they were the only existing slaves at the time. Looking at the processes on the machines showed that the launcher was still running, so I figured I would try starting the slaves up again remotely. However, every attempt to do so resulted in a timeout error.
I killed and started the launcher again in the exact same manor it was run before (forked, no GUI, as the same user), and as defined in the machine’s deadline.ini, it started a slave. However, even though the launcher forked properly, the slave didn’t, so when I exited that shell, it took the slave down. So I tried the remote-start again (since the launcher was still running), and got another timeout error.
I then tried killing and restarting the launcher (again, forked), but this time with the -noslave flag… and it still started a slave. Again, since this slave wasn’t forked properly, it died as soon as I closed my shell, but the launcher kept running. Again, attempting to remotely launch the slave resulted in a timeout.
Finally, I restarted the launcher one more time, again with the -noslave flag, and this time it didn’t start a slave. I was able to remotely start a slave through the monitor… but after that point, the launcher timed out on every remote operation, and eventually the slave just… went away. Every launcher operation now times out.
This happens on all 3 machines I’ve tried it on so far, all running Fedora 19 and Deadline 6.2.0.32. Our machines have no network restrictions, as they do not have any external access.
Help?
P.S. We haven’t even started using Deadline for anything yet.