DBR-Vray Whitelist Behavior

VelvetElvis · August 13, 2018, 9:00pm

I have been seeing some odd behavior when dealing with Vray DBR through the SMTD. Essentially we whitelist the one machine we want to use as the actual render machine for DBR rendering. Everything else is we just want to run as spawners. For example, we whitelist (may become master) Comp122. Comp001-121 are set on blacklist or may never become masters.

This works exceedingly well, until we hit a crash on our whitelisted machine. If that machine experiences a crash, the job immediately goes off to render on another machine which should be blacklisted as no other machine other than Comp122 should be able to become a master.

I’ve noticed that the DBR submission uses an entirely different whitelist/blacklist setup that say if you just submit a regular job. Has anyone else seen this and is there a way to make sure that the whitelist for DBR jobs stays correct after a random render crash?

If the whitelisted machine experiences a crash, the whole job becomes stuck in a loop and never actually renders.

eamsler · August 14, 2018, 5:19pm

This is a DBR offload job right? I may be out of the loop here, but I remember that the way this worked was that the first machine to pick up a task becomes the master, so I’m not quite sure how you’re able to use the whitelist/blacklist.

Can you go into more detail on how this is set up? Is it that you have one node specifically configured to use specific machines as its own spawner Slaves?

Most folks use this feature here:
docs.thinkboxsoftware.com/produ … ns-rollout

VelvetElvis · August 14, 2018, 9:46pm

Sorry, yes the dialog you sent is what we used for whitelisting and controlling the master. It is all done through the DBR offload settings.

What we are seeing, using the example you sent, is that if the machine named Gateway crashes for whatever reason then Deadline doesn’t wait to resend the master job back to Gateway. It now just randomly sends the job out to another machine on your farm and thus, the whole DBR job goes goofy. We see really random behavior such as the job is active, all of the spawners are active, but all of the CPU’s are at 0% across the job so it’s not actually doing anything. If we remote into the job we’ve seen 1 bucket per machine, when we should be seeing 8 buckets per machine.

VelvetElvis · August 15, 2018, 2:40pm

Outside of the whitelist bug… Could weird DR slave behavior be related to a repository traffic issue? We haven’t seen this type of behavior until we recently moved our repository to a central location for the firm. Now all of the firms servers, all of the firms rendering needs,and all of the firms renders are connecting to one single repository from offices all over the US and the world. We didn’t see any of these issues when we had the repository local or when the central firm repository was just only our offices users and servers connecting to it.

Using the DBR off-load whitelist settings seems to only be a one shot to render setting. Even if we suspend the job, when it is reactivated it randomly assigns a computer to the main task (task 0) even though it was submitted telling it that only one machine can be the master.

eamsler · August 15, 2018, 7:05pm

I’m assuming here that Gateway is the name of the primary DR machine. We named our fancy AWS Portal proxy instance ‘Gateway’ as well, so I got a bit confused.

The queuing mechanism in Deadline is completely distributed, meaning that any machine at any time can pick up any task. How it picks up those tasks is dictated by the rules in the “job” settings in the “Configure Repository Options” window, but it’s really a free-for-all. It means Deadline is very robust in its scheduling.

Now, I just learned this now but there is magic being done which lifts the blacklist restriction once a machine has picked up that first task so it’s back to free-for-all mode… The only thing I could think of that you could do would be to manually stop the job, reset the whitelist/blacklist to how it was when you submitted the job.

I’m also not sure if V-Ray DR slaves support changing their master or re-connecting to it once it’s fallen down so even if this mechanism didn’t work this way in Deadline you may have to restart all of the machines anyways.