8x GPU server + Deadline + considerations?

9krausec · October 18, 2017, 2:17pm

Hello all.

We are currently in the process of doing a big update to our puny farm and I wanted to see if anyone has had any experience with using Deadline on a Dual Xeon 8x GPU machine before we go down that road.

The machines we are planning to get are the SUPERSERVER 4028GR-TRT made by Thinkmate and will house 8x 1080ti cards - http://www.thinkmate.com/system/superserver-4028gr-trt/233833

Dual E5-2643 v4 3.40GHz xeons and a planned 64gb of ram initially, but expandable to 128gb if we need to purchase more.

The software that we will be using will be Deadline + Redshift via Maya + Redshift Standalone + Vray GPU via Maya on Win10. Has anyone had any experience with this type of setup before? We’d like to purchase with confidence, but I’ve heard of vague stability issues running Deadline + Redshift on servers housing 8x GPUs. Ideally I’d like to have at least 1 or 2 gpus per task.

If anyone that has this hardware type of setup up and running (or are experiencing stability issues) please pipe in. I’d like to get your input.

Thank you.

eamsler · October 18, 2017, 4:49pm

Not much advice from here. Deadline itself requires very little in the way of resources, so the general rule is check what your renderer needs. So, more of a Redshift + Maya question. If Deadline’s causing stability issues there though, I’m game to test and get to the bottom of that.

The the GPU affinity, there’s some good stuff built-in to Deadline for you:
docs.thinkboxsoftware.com/produ … u-affinity (per-job in Maya)
docs.thinkboxsoftware.com/produ … u-affinity (per-Slave)

MikeOwen · October 18, 2017, 5:01pm

I would have a chat with Tomasz as he seems pretty happy by what he has achieved with GPU & Deadline:
dabarti.com/vfx/short-guide-to-g … -v-ray-rt/
deadline.thinkboxsoftware.com/dabarti

I think Tomasz is also a guest speaker at an Nvidia event in Germany this week, talking about this topic as well?

9krausec · October 19, 2017, 3:00pm

Thank you for the replies. I just wanted to check if I should be aware of any hardware considerations. Glad to know that there aren’t any known quarks with Deadline and such hardware.

My rep at Thinkmate is setting up a box that I can remote into to test out everything before we purchase. I’ll keep an eye on Redshift as that’s my own concern right now.

Thank you again for the replies.

Fosters · October 23, 2017, 12:47pm

I believe Redshift will only span 2 GPUs…we’re using Octane with c4d and maya on our multi GPU boxes but not through Deadline (yet)

Strob · October 23, 2017, 3:16pm

I have a dual xeon with 8 gpu. It’s wonderful for mining. But any 3D work makes it really unstable and prone to blue screen.

I had to set a few things to be able to detect all gpu: set above 4G decoding in the bios (pcie section) and install windows 10 (previously on win 7).

And now my main problem is when deadline try to start 3ds max I am getting blue screen all the time.

I don’t know if it could be related to that 4G thing in the bios but I think it’s better to put a max of 4 gpu in one box. Anyway you will get more distributed points of failure.

MikeOwen · October 23, 2017, 4:46pm

@Strob - maybe its worth removing temporarily some of the GPU cards in your machine to see if it re-stabilises your machine?

Strob · October 23, 2017, 4:56pm

Yes for that I have to remove them all, set back above 4g to disable and reinstall nvidia driver maybe even windows. so i need a free day to try that. which is pretty difficult to find these days for me ;(

Strob · October 23, 2017, 5:39pm

It’s worth to note too that since I reinstalled windows 10 entirely it’s alot more stable but once in a while the computer freezes and the worst is that I can’t render with deadline at all on this computer, it will do a blue screen every time the deadline salve tries to load 3ds max.

Strob · October 24, 2017, 4:59pm

Just in case it could be of any interest: finally I removed a few GPU (I kept only 3 that I plugged directly in the PCI-e ports instead of using risers like previously) I reinstalled windows with the “above 4G” option disabled. And now I can render from deadline without problems!

Another app that was causing blue screen with above 4g enabled was the GPU renderer Redshift btw… Just starting max after installing Redshift demo was causing blue screen.

One thing I could have try if I really wanted 8 gpu on a single machine would have been to set bios mode to UEFI beofre installing windows and make sure windows installs with the bios mode set to uefi, (all of this with above 4g disabled), then install nvidia drivers and only after all this enabling above 4g. I found that info on mining forum.

But I didn’t have time to pursue my experiments in that field for now and anyway I have a brand new system coming in this week in which I want to fit 3 gpu too. So I will have 2 extra GPUs that I will have to fit somewhere in my render farm.

Anyway trying to fit 8 gpu in a box for mining is one thing but trying to work with such a machine is like opening a can of worm. And when you miss some steps in your initial installation you may discover it only later, for example once in production! and then you may have to resinstall everything .

Strob · October 24, 2017, 5:12pm

Oh and if you order that system from thinkmate and it works well for 3D rendering, I would be very curious to know if they use above 4G and UEFI mode too. And beware if you have to reinstall the system yourself…

9krausec · October 27, 2017, 5:30pm

Thank you for the advice! I didn’t have “notify” on for this thread so sorry I’m a little late to posting back here.

Thinkmate has agreed to setup a test machine that they will allow me to remote into! Only difference really is they will have Tesla cards in it, not GTX 1080ti cards (which is what we are planning on going with, but might change now that the 1070ti is out). As long as the Tesla cards of of pascal architecture, I figured they’d be fine for testing stability.

I really don’t “want” 8x GPUs per box, but it seems to be the biggest bang for our buck right now as opposed to spreading the new farm over more machines. We are running on a fixed budget now so I’m trying to make the most of it. I’d build the servers, but it needs to be off the shelf for liability reasons.

What I’m planning on doing for tests is install Maya/Vray/Redshift/Deadline on the boxes and submit both a Vray/Redshift animation scene (not at once of course). Have the scene be sort of heavy so it renders over night. See what happens in the morning (so ideally 3 days and two nights of testing). If there are any other variables I should be looking at, please let me know! Not going to futz with the bios (can’t, as it’s a remote machine), but if it does cause issues, I’ll recommend that bios change be made by the rep setting this up for me.

All I know is that if we get 2 super servers, spend all that money and shit goes south, I’m sort of skunked. Trying to mitigate risk here as much as I can.

Best,
Clayton

-notify is on now so I’ll be more prompt to reply back now.

Strob · October 27, 2017, 5:56pm

On my part I now have 4 gpu (2 inside and 2 on risers in an outside rig) and I installed windows with enabled 4G to disabled and then when I went in the bios after plugging the card, the bios was erroring about the pci-e ressources being short and was asking me to enable 4G, which I did. And now everything is running smoothly.

One thing I noticed too is that at first I use one brand of riser and I could not boot. I change the risers for another brand and it went smoothly. I suspect that my blue screen problems could have been due to those risers. So you may have less problems with that thinkmate box if you don’t use risers.

Other thing to take into consideration: if you use GTX instead of tesla you will need than camel back panel to be able to close your server (cause the power slot are on the side instead of in the front for Tesla) but I think they have it in the options at thinkmate. And also you have to choose a model that will not overheat since there will be no space between each card. My model (zotac mini 1080 ti) can not be mounted next to each other cause the fan protrude a tiny bit and it can’t even spin. I heard that the founder’s edition are better for air cooling while almost touching each other.

mirkoj · November 3, 2017, 8:08pm

First on build, 64GB is way to low.
How you usually do things with deadline and Redshift is to submit tasks with 2GPU per frame. That maximize scaling with GPUs and Redshift.
For 4GPU system that would be 2 frames at once with 2GPU per frame. That is either 2 concurrent tasks 2 GPU per frame or you can run 2 slave instances each overriding GPU affinity to use 2 GPUs per slave and render with all GPUs taht salve have ie 2 GPUs.
And now there is your problem

8 GPUs system would require 4 instances running, 2 GPUs each. That also means running 4 instances of maya and scene loaded into system RAM.
I have 4 GPU computers and from time to time I hit RAM at 64GB so fro 8 GPUs some minimum would be at least 128GB RAM. Unless your maya scenes are really light ones.

Another consideration is network speed. 4 instances loading same scene from same network location over single 1 GBit connection can be a bit of a bottle neck as well.

Honestly from what I;ve seen so far 44 GPU systems are kinda sweet spot for small render farms.
Larger one there is much bigger difference and affect on number of licences needed and space they take over. But for smaller farms 4 GPU systems are sweet spot.

MikeOwen · November 3, 2017, 8:28pm

Do you need 40,960 CUDA cores in a single machine?
aws.amazon.com/blogs/aws/new-am … 0-gpus-p3/
#justsaying

9krausec · November 3, 2017, 8:41pm

The only reason I was looking at 8x GPUs in a single machine because I cannot seem to find an off the shelf option that will do 4x GPUs with 1 CPU. Isn’t that sort of a waste as far as GPU rendering is concern to have 2x Xeons in a box with 4GPUs?

Hands down, that’s the only reason, due to cost. Now the 1gb bandwidth is concerning. 128gb+ of RAM isn’t as big of a deal, but the bandwidth is a good call. Our Maya scenes rarely go over 2GB in file size, so I don’t think that would be too much of an issue.