Multiple Slaves Starting

im_thatoneguy · August 19, 2011, 12:40am

We’ve got multiple computers seemingly randomly starting multiple slaves. What INI files etc should I check to disable this behavior? It’s causing a lot of problems with renderings failing.

rrussell · August 19, 2011, 1:20pm

There currently isn’t a way to disable the ability to launch multiple slaves. In beta 2, there will be an option to hide this feature from normal users, but that doesn’t prevent multiple slaves from starting up. That’s weird that they’re starting up randomly. When you find multiple slaves running on the same machine, do they have the same name, or different names? Also, the next time this happens, can you send us the launcher log from the machine that has multiple slaves running? And finally, you can remove slaves from the Launcher’s right-click menu on the machine.

Cheers,

Ryan

im_thatoneguy · August 19, 2011, 3:36pm

I assume they’re the same name since no extra slaves even show up in our monitor. Our launcher is running as a service but I can launch it from desktop and see if there aren’t extra names in the list.

im_thatoneguy · August 19, 2011, 5:01pm

2011-08-18 16:48:46: BEGIN - RENDER-I7-05\renderadmin 2011-08-18 16:48:46: Start-up 2011-08-18 16:48:46: 2011-08-18 16:48:46 2011-08-18 16:48:46: Deadline Launcher 5.1 [v5.1.0.45083 R] 2011-08-18 16:48:56: Local python version file: C:\Program Files\Thinkbox\Deadline\python\2.6.7\Version 2011-08-18 16:48:56: Network python version file: \\sfs-file\repository5\python\Windows\2.6.7\Version 2011-08-18 16:48:56: Comparing python version files 2011-08-18 16:48:56: Python upgrade skipped because Version files are the same 2011-08-18 16:48:56: Local version file: C:\Program Files\Thinkbox\Deadline\bin\Version 2011-08-18 16:48:56: Network version file: \\sfs-file\repository5\bin\Windows\Version 2011-08-18 16:48:56: Comparing version files 2011-08-18 16:48:56: Launcher Thread - Launcher thread initializing... 2011-08-18 16:48:56: Perfoming remote admin check 2011-08-18 16:48:57: Remote Administration is now enabled 2011-08-18 16:48:57: Launcher Thread - Remote administration is enabled 2011-08-18 16:48:57: Launcher Thread - Launcher thread listening on port 5042 2011-08-18 16:48:59: ::ffff:192.168.94.100 has connected 2011-08-18 16:49:00: Launcher Thread - Received command: LaunchSlave Render-i7-05 2011-08-18 16:49:00: Local python version file: C:\Program Files\Thinkbox\Deadline\python\2.6.7\Version 2011-08-18 16:49:00: Network python version file: \\sfs-file\repository5\python\Windows\2.6.7\Version 2011-08-18 16:49:00: Comparing python version files 2011-08-18 16:49:00: Python upgrade skipped because Version files are the same 2011-08-18 16:49:00: Local version file: C:\Program Files\Thinkbox\Deadline\bin\Version 2011-08-18 16:49:00: Network version file: \\sfs-file\repository5\bin\Windows\Version 2011-08-18 16:49:00: Comparing version files 2011-08-18 16:49:00: Launcher Thread - Responded with: Success| 2011-08-18 16:49:57: Perfoming remote admin check 2011-08-18 16:51:57: Perfoming remote admin check 2011-08-18 16:54:57: Perfoming remote admin check 2011-08-18 16:58:57: Perfoming remote admin check 2011-08-18 17:03:57: Perfoming remote admin check 2011-08-18 17:09:58: Perfoming remote admin check 2011-08-18 17:16:58: Perfoming remote admin check 2011-08-18 17:24:58: Perfoming remote admin check 2011-08-18 17:33:58: Perfoming remote admin check 2011-08-18 17:43:58: Perfoming remote admin check 2011-08-18 17:53:59: Perfoming remote admin check 2011-08-18 18:03:59: Perfoming remote admin check 2011-08-18 18:13:59: Perfoming remote admin check 2011-08-18 18:23:59: Perfoming remote admin check 2011-08-18 18:33:59: Perfoming remote admin check 2011-08-18 18:43:59: Perfoming remote admin check 2011-08-18 18:53:59: Perfoming remote admin check 2011-08-18 19:03:59: Perfoming remote admin check 2011-08-18 19:13:59: Perfoming remote admin check 2011-08-18 19:23:59: Perfoming remote admin check 2011-08-18 19:33:59: Perfoming remote admin check 2011-08-18 19:43:59: Perfoming remote admin check 2011-08-18 19:53:59: Perfoming remote admin check 2011-08-18 19:57:33: ::ffff:192.168.94.100 has connected 2011-08-18 19:57:33: Launcher Thread - Received command: OnLastTaskComplete ShutdownMachineIdle : : Render-i7-05 2011-08-18 19:57:34: Sending command to slave: OnLastTaskComplete ShutdownMachineIdle : 2011-08-18 19:57:34: Got reply: RENDER-I7-05: Sent "OnLastTaskComplete ShutdownMachineIdle : " command. Result: "" 2011-08-18 19:57:34: Launcher Thread - Responded with: Success| 2011-08-18 19:58:15: ::ffff:192.168.94.100 has connected 2011-08-18 19:58:15: Launcher Thread - Received command: OnLastTaskComplete ShutdownMachineIdle : : Render-i7-05 2011-08-18 19:58:15: Sending command to slave: OnLastTaskComplete ShutdownMachineIdle : 2011-08-18 19:58:16: Got reply: RENDER-I7-05: Sent "OnLastTaskComplete ShutdownMachineIdle : " command. Result: "" 2011-08-18 19:58:16: Launcher Thread - Responded with: Success| 2011-08-18 20:00:14: ::ffff:192.168.94.100 has connected 2011-08-18 20:00:14: Launcher Thread - Received command: OnLastTaskComplete ShutdownMachineIdle : : Render-i7-05 2011-08-18 20:00:14: Sending command to slave: OnLastTaskComplete ShutdownMachineIdle : 2011-08-18 20:00:15: Got reply: RENDER-I7-05: Sent "OnLastTaskComplete ShutdownMachineIdle : " command. Result: "" 2011-08-18 20:00:15: Launcher Thread - Responded with: Success| 2011-08-18 20:00:52: ::ffff:192.168.94.100 has connected 2011-08-18 20:00:52: Launcher Thread - Received command: OnLastTaskComplete ShutdownMachineIdle : : Render-i7-05 2011-08-18 20:00:52: Sending command to slave: OnLastTaskComplete ShutdownMachineIdle :

im_thatoneguy · August 19, 2011, 6:24pm

Only the one slave name in the launcher when viewed through the GUI.

And so far only the one slave as well.

im_thatoneguy · August 19, 2011, 6:48pm

Ok more specifically if I launch the service or launch the launcher directly no problems. Just one slave. It seems to only be when pulse starts a computer up for a job. If I start a computer. And then add it to a job. No problem. It’s only when pulse started up a computer it must be sending a command to launch a slave. Any reason why pulse would tell a computer to launch slave when it’s already running? Is there a TTD check that’s too short that needs to go longer so that it doesn’t think there is no slave running on startup?

rrussell · August 21, 2011, 1:27pm

Hmm, sounds like a power management issue then. That definitely helps narrow it down, so we’ll run some tests to try and reproduce. Normally, it shouldn’t matter if Pulse tells a slave to launch when it’s already running, because only once instance of the slave with the given name should ever run at a time.

Cheers,

Ryan

im_thatoneguy · August 22, 2011, 5:52pm

Just as another heads up. I (through necessity since they’re running as a service) changed the deadline.ini to tell slaves to launch on startup. Could that be a contributing factor?

rrussell · August 22, 2011, 8:47pm

Hmm, good question. We’ll definitely keep this in mind while trying to reproduce.

Thanks!

Ryan

im_thatoneguy · August 23, 2011, 2:21am

That seems to do it in one case. I got it to do it outside of a service on one machine. And I deleted the C:\programdata\deadline.ini entry for launchslaveonstart or whatever and now it only launches one slave. I enabled launch slave on startup in the GUI launcher and it didn’t add it back t othe programdata\thinkbox\deadline.ini so I’m not sure where you’re saving that now.

rrussell · August 23, 2011, 1:55pm

Thanks for the info! That setting is stored as a per-user setting, so it would be in %LOCALAPPDATA%\Thinkbox\Deadline\deadline.ini.

Just to confirm, after you re-enabled the setting, did the problem come back? Or was it only when launchslaveonstartup was in %PROGRAMDATA%\Thinkbox\Deadline\deadline.ini that the slave was launched twice?

im_thatoneguy · August 23, 2011, 3:18pm

Haven’t had time to further test. I needed all of the systems for rendering last night.

im_thatoneguy · August 23, 2011, 5:43pm

Hmmm another machine now is acting funny.

rrussell · August 23, 2011, 5:50pm

Good to know. We hope to look into this issue after beta 2 is released (which should be later this week or early next week.

Cheers,

Ryan

im_thatoneguy · August 24, 2011, 12:25am

I removed it from ProgramData. Still launched two.
I removed it from UserData and it properly auto-configured and added it back in.

If I had to place a bet I would say the auto-config is doing it somehow.

im_thatoneguy · August 26, 2011, 12:43am

I turned off auto-configure; there was no improvement.

rrussell · August 26, 2011, 2:16pm

This ended up being a more general issue than we originally thought. The problem was that there was a large gap between when a slave starts and when a new slave can tell that the slave is already running. Basically, you could start up as many slaves with the same name providing that the original was still showing its splash screen. It wasn’t until the splash screen went away that new instances would know it’s running and not start up anymore.

This gap has been closed in our internal build, so this issue should be fixed in beta 2!

Thanks again for providing all the information you did. It made it much easier to know where to start looking!

Cheers,

Ryan

im_thatoneguy · August 26, 2011, 5:42pm

Great! Thanks for tracking it down, I’ll stop randomly changing farm settings for each job I submit.

And I guess my first instinct was the right one haha.

anon63061193 · September 26, 2011, 10:02am

Hm, i’m still getting that issue here, on Fedora12. When i start deadlinelauncher -nogui there are multiple slaves starting at the same time with random slave names from the rest of the machines in our studio, which had previously launched slaves. I deleted the whole slaves folder, and still no luck. It’s possible bucause we’re using network distribution of fedora here in the studio, so every machine has the same exact system. The slaves are running on single username and i see now that in the local settings of the user there are all the slaves. Is there a way to fix this somehow?

rrussell · September 26, 2011, 2:39pm

The slave configurations for each instance on a machine are stored in the local install folder. For example, if the client is installed to /usr/local/Thinkbox/Deadline, the folder that contains the slave instances will be /usr/local/Thinkbox/Deadline/settings/slaves. Note that the slaves folder in the local user settings are just configuration settings for those slaves. The slaves folder in the install folder controls which slaves actually start up.

Just to confirm, is this local install folder shared between all nodes as well?