AWS Thinkbox Discussion Forums

Needed: More rigid slave definition mechanism

One thing Deadline has needed for a long time is a better way of defining the slaves that exist on any given machine, and a reliable association between slave processes and slave entries in the repository.

Currently, the presence of empty .ini files in /var/lib/Thinkbox/Deadline10/slaves (on Linux) is used to define slave instances. However, these files do not reliably map to the slaves that actually appear in the repository, which can easily lead to multiple deadlineslave processes thinking they are the same slave and stepping on each other during renders.

As an example, I’m currently looking at a machine that ostensibly (according to the slave list in the Monitor) hosts 5 slaves. The machine’s hostname is rd-240, and the slaves that appear in the Monitor are:

rd-240-01 rd-240-02 rd-240-03 rd-240-04 rd-240-05
However, when you actually look at the machine, there are more than 5 slave processes running, and the “slaves” directory is a complete mess:

01 ~.ini 01.ini 02 ~.ini 02.ini 03 ~.ini 03.ini 04 ~.ini 04.ini 05 01.ini 05.ini
When the launcher starts, it will actually spawn 10(!) deadlineslave processes, with pairs of slave processes thinking they are the same slave. When rendering, they both try to use the same log files (including the renderthread logs), which ends up causing fairly generic-looking error reports on jobs (e.g. when one of the slaves decides to remove the log because it fails a task, but the other one still wants to use it). I was not the person who created the slaves on this machine, so I don’t know what the user entered in the “Start New Slave Instance” input to create these, but if I hadn’t seen this pattern years ago and eventually tracked down the cause, I would be pretty confused.

The upshot of all this is that Deadline really needs a more robust, rigid method of defining slaves on hosts, tracking slave processes, and associating them with slave entries in the repository, because the suffix-based approach is still demonstrably buggy and unreliable.

I totally agree there Nathan. I think one of the issues here is that we’re using local files for this. One of the reasons for that is it would allow a Launcher to start up Slaves without the need to connect to the database. If we moved that data to the database I think we would need to have an event fire once the Launcher starts up to trigger those new Slaves.

Another piece of this puzzle is that the Slaves with the same name should not start. There’s a race condition here as they try to grab a TCP port to use as a token, and they write the number into those INI files. That’s another piece we’d want to fix in all of this.

Also, what version of Deadline was this? I have a pretty similar report for 10.0.2 here reported some time ago. Also, are you able to reproduce? I’m curious what other hints we could use to give developers a hint.

Privacy | Site terms | Cookie preferences