One thing Deadline has needed for a long time is a better way of defining the slaves that exist on any given machine, and a reliable association between slave processes and slave entries in the repository.
Currently, the presence of empty .ini files in /var/lib/Thinkbox/Deadline10/slaves (on Linux) is used to define slave instances. However, these files do not reliably map to the slaves that actually appear in the repository, which can easily lead to multiple deadlineslave processes thinking they are the same slave and stepping on each other during renders.
As an example, I’m currently looking at a machine that ostensibly (according to the slave list in the Monitor) hosts 5 slaves. The machine’s hostname is rd-240
, and the slaves that appear in the Monitor are:
rd-240-01
rd-240-02
rd-240-03
rd-240-04
rd-240-05
However, when you actually look at the machine, there are more than 5 slave processes running, and the “slaves” directory is a complete mess:
01 ~.ini
01.ini
02 ~.ini
02.ini
03 ~.ini
03.ini
04 ~.ini
04.ini
05 01.ini
05.ini
When the launcher starts, it will actually spawn 10(!) deadlineslave processes, with pairs of slave processes thinking they are the same slave. When rendering, they both try to use the same log files (including the renderthread
logs), which ends up causing fairly generic-looking error reports on jobs (e.g. when one of the slaves decides to remove the log because it fails a task, but the other one still wants to use it). I was not the person who created the slaves on this machine, so I don’t know what the user entered in the “Start New Slave Instance” input to create these, but if I hadn’t seen this pattern years ago and eventually tracked down the cause, I would be pretty confused.
The upshot of all this is that Deadline really needs a more robust, rigid method of defining slaves on hosts, tracking slave processes, and associating them with slave entries in the repository, because the suffix-based approach is still demonstrably buggy and unreliable.