Hardware part of setup

mirkoj · August 15, 2019, 8:41pm

I was trying to find a bit more details on this but no luck so figured, where could be better place than where all the guys with farms are actually meeting

So I’m lately running into couple weird issues so figured to try and find some help.
First of all here are a couple of details on my small farm:

server older dual xeon, 8GB ram for deadline repository, with windows server 2016 on it.
repository is on an Synology 1813+ NAS, where al the projects are as well. It has 8x3Tb WD red pro HDDs, connected with 4x1GBit link aggregation to 16 port switch. LACP on switch is working all fine.
next are 10 render nodes, all connected with 1GBit connection, 4GPUs each for redshift and another GPU rendering. 4 of them are linux CentOS7.6 based and others are windows 10. They all have ssds for OS and software. Redshift and maya preferences are loaded from shred NAS location.
On top of that, back home is another smaller Synology 415+ NAS with 4 wd drives and to it connected single workstation with 2 more GPUs, linux mint on it.

Now the first thing that I’ve noticed on slave logs is that windows machines are sometimes, not always, taking 2 or more times longer to save rendered frames to NAS. When CentOS render nodes finishes rendering and save frame and finishes like:
Saved file ‘//media/Storage/…y.1061.exr’ in 2.82s
While windows:/EP14_040_RENDER.v001_masterLayer.Beauty.1061.exr’ in 2.82s
and windows
[Redshift] Saved file ‘P:…1060.exr’ in 5.15s

But even worse thing I run into.
Rendering at home 2 frames at once, each with single GPU
and at farm rendering with 2 GPUs per frame 2 frames at once (4GPUs in a system)
Those single GPUs renders frames faster then 2 GPUs at render nodes on the farm.
Same GPUs ofc testing with 1080ti.

Now the only thing that comes to my mind is that NAS could be a big bottleneck here.
For both loading projects, saving done frames and everything else in between.

Now after al this thining what do you guys think, and if anyone would be happy to share a bit details on their farm setup and especially on the NAS part.
I kinda feel that a while ago when the farm was smaller this NS could server but farm outgrew it and now is a bottleneck. Or is it?
Looking forward for any tips or tricks from anyone that got a bit more experience and time on this,
Thanks!!!

kwatts · August 16, 2019, 3:29pm

This inst an easy thing to figure out, hopefully i can suggest some things that might help you on your path:

So i suspect you might be hitting an io problem.

This might be local disk or writing to the shared nas.
Starting off, linux has faster io that windows, so a few seconds of difference is not something to worry too much about. a few min then id be worried.

Id start with collecting telemetry on your nas, io stats, network, make sure that your not hitting a single eth connection. Does your switch allow you to collect stats on what the through put of the ports are and what is connected to them?

Collecting telemetry:
since you have a mix of windows and linux machines, i would suggest a telegraf/influxdb/grafana stack, that would allow you to see what is occurring on all the machines in your farm, from a hardware level.
Im not familiar with the synology machines ,but if they have a linux based operating system, while you may not beable to install telegraf, you can create cron jobs of simple shell scripts to collect information and forward them to influx/grafana so your at least collecting them.
I have setup my home sh***y wd nas, with similar shell scripts to forward telemetry to a central graphing system, so i know its possible.

Ultimately your trying to see if one or more things are peaking or maxing out more than others .

One last thing you could also try, on one of your windows machines, move all the maya/redshift software locally to the slave, so that your not competing for io when your writing the frames out.
We have this setup on our dl farm, linux software is streamed over the from the nas, windows all the software is local.

hope this helps.

Daniel_Halloran · August 16, 2019, 8:08pm

That’s some really good advice from kwatts, as always!
I have a very similar setup to you mirkoj, using Synology boxes at studio. Synology DSM is Debian based.
We don’t do any GPU rendering, as I don’t have a single GPU in my small Supermicro farm, but can confirm better/faster IO write times with Nuke on Cent OS 7.5/7.6 vs Windows.

Synology DS1517+ ext4 RAID5 - 5 6TB disks - 21.72 TB -
All Seagate Ironwolf drives, and 4 1g nics bonded like yours.
What does your overall throughput look like while everything is full throttle?
From everything Ive read and seen these 4 1g nic Synology boxes can do about 400 MB Up/Down with all 4 nics going. Ive only seen my Nas reach that a couple times, when things were in full gear production wise. Briefly had artists complain of slow playback on plates in Nuke etc with about 16 artist workstations/clients and a dozen Slaves rendering on the farm at the same time. Mostly 4k EXRs too.

Not sure if you’re loading apps locally or from the NAS as kwatts asked, but I prefer to install things locally, as to not clog my 1g network. Though having all APPs hosted on a NAS would be great for workstation roll outs.

Another thing to note, is I do not have the repository on the Synology box to which the slaves are writing out frames on. Ive got the database and repository installed separately on a Windows 2012 R2 server.
I would seem to think you’re home setup doesnt have the IO overhead of the other. Honestly everyones biggest bottleneck at this moment in time is SATA. Makes me feel like a 10g network wouldn’t be efficient unless youre running all flash/nvme storage.
That’s about my 2 cents, hope its somewhat insightful.

mirkoj · August 18, 2019, 7:54pm

Good points.
So as I figured out what coul dbe part of problems is:

disk/volume read/write speed
network bottleneck

So to try and see if I can get anything better sorted out I;ve borrowed an server to test out:
HP ProLiant DL380e Gen8 12 LFF
2 x 676947-001, Xeon E5-2420 6 Core 1.9 GHz
6 x 672631-B21, HP 16GB (1x16GB) DDR3 MEM KIT
12 x 693687-B21 HP 4TB 6G SATA 7.2k rpm LFF
2 x 656362-B21, HP 460W Common Slot PSU
I’ve dropped freeNas on it (not as smooth as Synology’s DSM but does the work and still more usable then advanced Linux And for now fired up 4x1GBit LACP, but will add one more card for total of 8 ports in link aggregation. 2 port 20gigabit is better but I would have to change switch as well so for now this will due I guess.
To start I think I did see a bit better overall disk read/write speeds with ZFS file system and radiz3.
But proper test will be done tomorrow when I put better aggregation and try to saturate with all 10 render nodes.
SO plan is to move projects to new file storage and leave old synology for other shared stuff, like deadline repository, maya preferences and similar stuff that is shared over network.
Will see how it goes
Thanks for all points!