On most of our slaves, the deadline launcher disappears after a day or two… this makes remote management very painful, as it requires a wrangler waste a day by remoting into each machine manually, and start the launcher / slave up.
We get no crash reports in the event viewer, nor anything out of ordinary in the launcher logs…
For example:
2013-10-03 15:57:22: BEGIN - LAPRO0431\scanlinevfx
2013-10-03 15:57:22: Deadline Launcher 6.1 [v6.1.0.52622 R]
2013-10-03 15:57:25: Local version file: C:\Program Files\Thinkbox\Deadline6\bin\Version
2013-10-03 15:57:25: Network version file: \inferno2\deadline\repository6\bin\Windows\Version
2013-10-03 15:57:25: Comparing version files…
2013-10-03 15:57:25: Version files match
2013-10-03 15:57:25: Launching Slave:
2013-10-03 15:57:25: Launcher Thread - Launcher thread initializing…
2013-10-03 15:57:25: Remote Administration is now enabled
2013-10-03 15:57:25: Launcher Thread - Remote administration is enabled
2013-10-03 15:57:25: Launcher Thread - Launcher thread listening on port 17060
Then after that the launcher disappeared at one point, cause it wasnt there when i just remoted in. I started it again, and got this in the log:
2013-10-04 17:16:13: BEGIN - LAPRO0431\scanlinevfx
2013-10-04 17:16:13: Deadline Launcher 6.1 [v6.1.0.52622 R]
2013-10-04 17:16:14: Local version file: C:\Program Files\Thinkbox\Deadline6\bin\Version
2013-10-04 17:16:14: Network version file: \inferno2\deadline\repository6\bin\Windows\Version
2013-10-04 17:16:14: Comparing version files…
2013-10-04 17:16:14: Version files match
2013-10-04 17:16:14: Launching Slave:
2013-10-04 17:16:14: Launcher Thread - Launcher thread initializing…
2013-10-04 17:16:14: Remote Administration is now enabled
2013-10-04 17:16:14: Launcher Thread - Remote administration is enabled
2013-10-04 17:16:14: Launcher Thread - Launcher thread listening on port 17060
I cant find anything deadline related in the event viewer between yesterday and today… :\
Hmm, I’m guessing there is nothing in the Launcher log or event viewer again? This is so strange…
We have the same version of the Launcher that’s been running on a test machine for a week now, with no signs of memory issues (so at least the memory leak has been fixed). We also have an unhandled exception handler in the Launcher, so if an exception was being thrown, it should be caught and written to the log.
Here’s something to try if you’re willing. Maybe pick 5 or 10 machines, close the launcher if it’s still running, and then open a command prompt and run the launcher with the -console flag:
This will attach a Windows console to the launcher so that everything the launcher prints out goes to stdout. If it is throwing an exception that’s not making it to the logs, it should at least show up in the console. Then if one of these launchers dies, send us the console output and we’ll take a look.
Do you think it would make sense to add a keepalive log signal into the launcher, that gets fired with some stats etc every say 5 minutes. We could turn on this verbose logging on ~100 slaves or so, then see what happens.
Well the console helped to narrow it down a bit. On the machines where the launcher went missing, the console was also missing
So it seems that thelauncher does not survive a reboot. I tried rebooting a couple slaves, and the launcher would start, trigger the slave startup, then it would disappear right away. When i go into these boxes later, i see the slave running, but no launcher.
Well that’s odd. The launcher always starts up on our test machines and stays up, so I can’t imagine why it closes itself on your machines after launching the slave.
I’m assuming you guys have your nodes setup to auto-login? Also, what does the DeadlineLauncher key in HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows\CurrentVersion\Run look like? It should just contain the full path to deadlinelauncher.exe.
Try changing directories to “C:\Program Files\Thinkbox\Deadline6\bin” first before running those commands. Currently, the deadline apps require that the current working directory be the bin folder so that it can find some dlls. I’m checking to see if there is a way to avoid this requirement.
The only difference i see in the log of a regular startup and an interrupted one is that after the Launcher thread initializing entry, the interrupted one quits, the proper one puts Remote Administration is not enabled
Good startup (via manual icon start):
2013-10-22 12:05:30: BEGIN - VCPRO1014\ScanlineVfx_user
2013-10-22 12:05:30: Deadline Launcher 6.1 [v6.1.0.53080 R]
2013-10-22 12:05:32: Local version file: C:\Program Files\Thinkbox\Deadline6\bin\Version
2013-10-22 12:05:32: Network version file: \inferno2.scanlinevfxla.com\deadline\repository6\bin\Windows\Version
2013-10-22 12:05:32: Comparing version files…
2013-10-22 12:05:32: Version files match
2013-10-22 12:05:32: Launching Slave:
2013-10-22 12:05:32: Launcher Thread - Launcher thread initializing…
2013-10-22 12:05:33: Remote Administration is now enabled
2013-10-22 12:05:33: Launcher Thread - Remote administration is enabled
2013-10-22 12:05:33: Launcher Thread - Launcher thread listening on port 17060
Bad startup (starts slave, then disappears):
2013-10-22 12:05:15: BEGIN - VCPRO1014\Administrator
2013-10-22 12:05:15: Deadline Launcher 6.1 [v6.1.0.53080 R]
2013-10-22 12:05:18: Local version file: C:\Program Files\Thinkbox\Deadline6\bin\Version
2013-10-22 12:05:18: Network version file: \inferno2.scanlinevfxla.com\deadline\repository6\bin\Windows\Version
2013-10-22 12:05:18: Comparing version files…
2013-10-22 12:05:18: Version files match
2013-10-22 12:05:18: Launching Slave:
2013-10-22 12:05:18: Launcher Thread - Launcher thread initializing…
We found a way to workaround this that will be included in beta 9. When the Deadline applications startup, they just set their current working directory to their bin folder, and then Deadline can find those dlls just fine.