AWS Thinkbox Discussion Forums

3ds max Vray DBR job dying without a trace

Hello!

We recently started using deadline and so far its been going great, however recently a problem that’s been happening is that the Vray DBR jobs that are being submitted run fine for 5-10 frames then suddenly without a trace stops working. There are no errors reported in the deadline monitor, nothing in the history log of the job in the monitor, the only trace is that in the Vray messages it just says for each node that it refused the connection and won’t be rendering or as the case with the attached image that a node says it disconnected while it was still rendering, leaving the frame to sit there because the bucket the disconnected machine was rendering still says it’s being rendered by the disconnected machine.

We are running deadline 10.0.11.1, submitting from 3ds max 2017 update 4 (I believe), using Vray 3.60.0 on all win 10 machines

Any advice on this topic??

Thanks for your time!
Best regards!

(Apologies for one of the images being captured a little clumsily, did not bring the Vray messages window to the top when I snapped the photo, but the one green message is the one saying that node 204-14 is not responding)

It sounds like a TCP connection issue… This is a new one for me. Are these problem machines on a shared network switch or have any obvious similarities on the infrastructure side? I haven’t seen this yet personally.

Heyo!

We are not that big of an operation here so all the machines here run through the same 10GB switch. As for the machines they are as similar as possible. They are all the same hardware, with exception of the HDD but they are not in use by deadline, aswell as they are all based on a deployed image of windows 10 and all are updated together so they are as similar as we could get them software wise aswell. I’ll definetly have a look at the TCP tip though, I assume checking for the vray distributed rendering port 20204? Or are there additional ones that deadline uses that I should be aware of and have a look at aswell?

Thank you so much for the reply!

I’ve seen this kind of problem before with V-Ray DBR, whether it be interactive DBR in Deadline OR Deadline’s “off-load” DBR feature. Essentially, too many machines connecting to the MASTER machine is overloading it at OS/TCP/UDP socket level (DBR uses both TCP and UDP). Couple of options I usually recommend:

  • Don’t try to use > 10 machines with DBR. Things become unstable otherwise.
  • Invest in high-end fibre, 10Gb+ interconnections throughout your studio. (too expensive and excessive)
  • Disable any automatic asset-transfer settings in the V-Ray DBR settings dialog, as that adds considerable overhead to network traffic. Fundamentally, you need to do everything possible to reduce network traffic.
  • Exit ALL other applications on the MASTER machine whilst DBR rendering. Don’t look at Facebook or stream video!
  • Enable the setting in V-Ray DBR to NOT render on the local MASTER machine. The MASTER machine can then concentrate on network communications only. Yes, less render power, but use another spawner machine instead. Either way, keep it under 10.
  • If still unstable with max 10, then try max 5. Maybe your network is heavily congested.
  • V-Ray’s new Swarm thing is much better, but not yet available for 3ds Max.

I see, thanks for the wisdom!
I’ll be sure to try these things!

One day, one beautifull day we might have a fibre system here, but not right now, as you said, alittle too expensive.

Is this 10 DR limit still a suggestion? I ask because we’re getting a lot of not responding DR nodes as well and we’re using 22 at the moment.

I ask because before Deadline when we just kept the DR spawners running on all of our farm machines, we consistently were able to DR with 20-30 machines with no issue what so ever. Only introducing Deadline have we seen consistent DR issues. Nothing has changed on our master node, except that we now are using Deadline. For the last 4 years we have had a near flawless DR farm that our users could fire and forget renders and all 20-30 DR nodes will consistently pick up. In the last month or so of using Deadline, it has been a bit of a step back in our rendering efficiency. I’m fighting a losing battle in trying to convince the team that Deadline was a good idea.

Once the master machine picks up, who does the majority of the DR communication? The master node or does the master have to constantly check back in with the Deadline Repository to see what is going on?

Yeah, and it seems you’ve been having the issue for a little bit here:

The master node streams out the information and coordination between the DR slaves as far as I know.

We should isolate this out from Deadline entirely. Can you start up the DR nodes the old fashioned way and run a render from the master from within the Max UI? Deadline itself doesn’t do much to interfere with V-Ray DR other than modify the config file to dynamically point to different render nodes. I think it’s worth removing Deadline from the equation and see where the slowdowns are.

In some very limited testing, I’ve seen better behavior from the Vray Spawners if we just leave one DR-Spawner job running that fires up all of the Vray Spawners on the DR nodes. We input all of the DR node information into the DR Settings in Vray. Then we submit a single job to the Deadline repository that is limited to just one single machine that acts as the master rendering machine. I have seen faster DR slave pick up times and I consistently get all 20 machines connecting.

It makes sense that this on-demand type of rendering is faster and more reliable for our situation. I think in our case since we 100% render using DR that doing it with the DBR off-load we were just taxing the system by constantly starting/stopping the DR spawners with each job that rendered. By just leaving the Spawners up 100% of the time, the only thing that is changing are the jobs using the Spawners.

In a more mixed type of rendering environment such as you may render a Jigsaw tile job, you may render an animation, and you may render a DR job using the DBR off-load would be more suited. But in our 100% DR environment, it may have just been too much to handle. I’ve also said it before, I don’t think it helps that we have one repository managing every office’s jobs and servers. When we kept the repository local to our office, DBR off-load was much more responsive.

This is pretty much the same way we managed our render farm before Deadline and we just used Backburner. We’d have the DR spawner start when the machine was logging in to the rendering account and just sit there waiting for work. In a test with Deadline completely of the mix, everything worked as expected as well.

One thing, as I had a call with someone who works with @VelvetElvis, and one of the big hang ups was that they must have a specific machine become the DR master and when there is a failure on a task, Deadline Slaves now walks to a new task until it can find one it can pick up. The implementation of the plugin requires task 0 to be the master.

What makes this more difficult is that there were more tasks then there were machines, so the master node would walk to a a free task and sit there indefinitely.

I’d say for anyone who has a small and fixed number of nodes and they require DR, don’t use the offload. If you have a large pool where any machines could be the master on task 0, DR saves a lot of configuration and management effort.

Privacy | Site terms | Cookie preferences