Disallow slaves with duplicate names?

nrusch · October 21, 2014, 9:26pm

Would it be possible to add some stricter logic to the slaves to prevent duplicates with the same name from existing?

rrussell · October 22, 2014, 1:37pm

Are duplicate slaves actually showing up in the Monitor? The slave name is used as the unique identifier, so this shouldn’t be possible…

nrusch · October 22, 2014, 4:27pm

No, but there is apparently nothing stopping a single machine from running multiple slaves that shadow each other under the same name. We had a building-wide power failure the night before last, and getting the Deadline slaves started up again properly was an absolute nightmare. Machines were starting 4 or 5 slave processes, only one of which would actually respond to a “shutdown” request (I’ve previously reported this as a bug). However, they all seem to think they can dequeue tasks, so you end up with a crippled render node that can only be taken offline by manually killing the slave processes and then restarting the launcher.

I know the running theory is that this kind of behavior is caused by issues with stalled slave restarting, but since there is nothing preventing it from actually happening, there’s no reason it can’t occur under different circumstances.

I believe the idea of eliminating the “one process per slave instance” approach has been discussed, and I REALLY like that, as it should eliminate a lot of the current reliability issues with Deadline processes. Is that change on the road map? I can’t remember how that discussion ended…

LaszloSebo · October 22, 2014, 4:30pm

We have these duplicate slave issues as well. I thought it was our fault somehow (as we start deadline with our internal management app).

Sometimes we get 4-5 instances…

rrussell · October 22, 2014, 4:32pm

The problems you are describing have all been addressed in Deadline 6.2.1. RC2 should be up later today, which will likely be the final release candidate unless something significant comes up.

Cheers,
Ryan

nrusch · October 22, 2014, 6:04pm

That’s good to hear. Any thoughts or news on the slave process consolidation idea?

rrussell · October 22, 2014, 6:26pm

Nothing new to report here. It’s still something we’re going to consider during the design work we’re doing for version 8.

Cheers,
Ryan

nrusch · October 23, 2014, 2:06am

OK, thanks for the info.

anon7385795 · November 19, 2014, 1:29pm

I noticed some strange behaviour with some of our renderblades status flicking back and fourth between one thing and another.
turned out two slaves with the same name were running on the machine at the same time. -rebooting seemed to fix it.

rrussell · November 19, 2014, 3:28pm

Hi puretvfx, which version of Deadline were you experiencing this with?

Thanks!
Ryan

anon7385795 · November 19, 2014, 6:08pm

[DeadlineRepository]
Version=7.0.0.33

I think it may have been caused by us doing silly things - thought it was relevent to the thread though.

anon5658894 · December 23, 2014, 9:41am

Hi

I’ve just noticed this issue on our farm - multiple, identically named slaves spawned by one launcher:

root 5198 1981 9 Dec22 ? 01:35:53 ./mono --runtime=v4.0 /mount/apps/linux/thinkbox/deadline/Deadline7/bin/deadlineslave.exe -nogui -name arno
root 5201 1981 10 Dec22 ? 01:36:37 ./mono --runtime=v4.0 /mount/apps/linux/thinkbox/deadline/Deadline7/bin/deadlineslave.exe -nogui -name arno
root 11075 1981 5 06:51 ? 00:08:34 ./mono --runtime=v4.0 /mount/apps/linux/thinkbox/deadline/Deadline7/bin/deadlineslave.exe -nogui -name arno
root 18953 1981 6 08:00 ? 00:05:11 ./mono --runtime=v4.0 /mount/apps/linux/thinkbox/deadline/Deadline7/bin/deadlineslave.exe -nogui -name arno
root 18954 1981 6 08:00 ? 00:04:59 ./mono --runtime=v4.0 /mount/apps/linux/thinkbox/deadline/Deadline7/bin/deadlineslave.exe -nogui -name arno
root 19486 1981 6 07:56 ? 00:05:13 ./mono --runtime=v4.0 /mount/apps/linux/thinkbox/deadline/Deadline7/bin/deadlineslave.exe -nogui -name arno

Is there a configuration setting to avoid this?

Deadline version 7.0.0.50

Cheers
Andrew

rrussell · December 23, 2014, 1:53pm

Hmm, that’s odd. Is this happening across your farm, or is it limited to a few machines?

There are settings to disable the ability to launch multiple slaves, but I’m not sure that would help here, since only one instance of a slave with a given name should be running at any one time (and the command line shows they all have the same name). I wonder if any of these are hung. If you go to the Deadline 7 logs folder on this machine (/var/log/Thinkbox/Deadline7), do you see logs for all these instances (they’ll end with 0000, 0001, 0002, etc)? If so, are all logs still being written to?

Cheers,
Ryan

anon5658894 · December 23, 2014, 4:36pm

This seems to be limited to the machines that I’ve set up to use idle detection, and I’m starting the launcher using a lightDM script as per my recent post in the “Deadline 7 Bugs” forum.

I’m not sure which log you might expect to see multiple instances of. One thing I did notice is that in this log - deadlinelauncher-arno-2014-12-23-0002.log - it does mention launching a slave more than once because it’s in more than one scheduling group. This isn’t how it would normally be configured, but is while I’m testing things out -

2014-12-23 14:41:18: Launcher Scheduling - Launching slave arno because this machine has been idle longer than 10 minutes (scheduling group "z6-test") 2014-12-23 14:41:18: Launching Slave: arno 2014-12-23 14:41:18: Launcher Scheduling - Launching slave arno because this machine has been idle longer than 10 minutes (scheduling group "Z620") 2014-12-23 14:41:18: Launching Slave: arno 2014-12-23 15:19:31: Launcher Scheduling - Stopping slave arno because this machine is no longer idle (scheduling group "z6-test") 2014-12-23 15:19:32: Sending command to slave: StopSlave 2014-12-23 15:19:38: Got reply: arno: Sent "StopSlave" command. Result: "Connection Accepted.

I’ll alter the groups so the workstations I’m looking at are in one schelduling group only, but I’d say this shouldn’t happen anyway as it might be useful to have workstations in a couple of scheduling groups.

rrussell · December 23, 2014, 5:31pm

The way the slave command line works (or should work) is that if the slave is already running, the command line will detect that and exit. Do you know if any of these processes are “zombie” processes? We did resolve an issue for the upcoming 7.1 release where the Launcher didn’t always clean up processes that it launched properly, and I wonder if that’s what’s happening here…

anon5658894 · December 23, 2014, 5:48pm

There don’t appear to be any zombie processes on the render nodes. I’ll keep any eye on things…