Worker stall...machine crash

Hi all, newbie here. We have built a 6 gpu farm running i9 with 64 gig ram and installed deadline.
We are able to get frames rendered using single worker, single task, 6 gpu’s on single task. We have not been able to run 1 or 2 workers for longer than like 5 minutes before workers stall causing system crash. Error stalled (no plugin). Project is local and has no plugins except redshift which is also local.

Upon restart does the same thing.
Any ideas.

In my experience Redshift require a healthy amount of CPU alongside the GPU workload, and quite frequently in real-world rendering scenarios there’s not even remotely enough VRAM available, so there’ll be lots of swapping to system RAM (and/or disk).

What I’m trying to say is that running concurrent GPU tasks is a pretty imprecise art, at least for now. And I personally would probably just do a single beastly 6x GPU task instead of splitting it up.

Side note: We found that best value for money vs. performance seems to be 2x GPUs for a Redshift job. On that setup, we would also frequently run an extra CPU-only worker and found no particular issues with that setup.

1 Like

Have you set GPU affinity so both workers doesn’t use all 6 GPUs at the same time?
I might be wrong but to me 64gb ram for 6 GPUs sounds a little low. We have slaves with 4 GPUs and they all have 128gb.

Regards
Bonsak

1 Like

You’ve not mentioned which OS you’re running, which version of Deadline or host application (or is it just Redshift Standalone)

As mentioned above I’d check the affinity on the Workers, if you’re submitting to two workers with a 3 card limit it maybe switching between the cards, best to set Affinity, there are quite a few posts on here stating there are issues with the card allocation, i find setting the affinity on the worker the best.

Do you have the TDR set if running Windows? Are you using any kind of monitoring system to check all the CPU/GPU/RAM resource usage?