GPU Affinity on Linux Centos 6.8

Hey,

Having some issues rendering on a Fresh Linux system, the system has 8 Nvidia 980Tis in it, and i’ve setup the GPU Affinity so each frame will be allocated two GPUs, So I have 4 slave instances running at a time.

Ex:
GPU 1 - 0,1
GPU 2 - 2,3
GPU 3 - 4,5
GPU 4 - 6,7

I’m used to looking at Deadline slave logs on windows, not so much on Linux, it appears as if the slaves are using all 8 gpus across 4 frames concurrently.

This seems to be causing slaves to hang and it also crashes MayaBatch occasionally, I’m not sure if this log is related to the issues but using GPU affinity on Windows before did resolve many errors for us.

Screenshot of GPU Affinity / Slave instance Names.
http://i67.tinypic.com/2i1yr8k.png

Here’s the log of a Slave, I can attach the full log if needed.

2016-10-06 15:22:57: 0: STDOUT: mel: mel: [Redshift] Cache path: /root/redshift/cache
2016-10-06 15:22:57: 0: STDOUT: [Redshift] Redshift Initialized
2016-10-06 15:22:57: 0: STDOUT: [Redshift] Linux Platform
2016-10-06 15:22:57: 0: STDOUT: [Redshift] Release Build
2016-10-06 15:22:57: 0: STDOUT: [Redshift] Number of CPU HW threads: 24
2016-10-06 15:22:57: 0: STDOUT: [Redshift] Total system memory: 78.59 GB
2016-10-06 15:22:57: 0: STDOUT: [Redshift] Creating CUDA contexts
2016-10-06 15:22:57: 0: STDOUT: [Redshift] CUDA init ok
2016-10-06 15:22:57: 0: STDOUT: [Redshift] Ordinals: { 0 1 2 3 4 5 6 7 }
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Initializing GPUComputing module (CUDA). Ordinal 0
2016-10-06 15:23:04: 0: STDOUT: [Redshift] CUDA Ver: 8000

2016-10-06 15:23:04: 0: STDOUT: [Redshift] Device 1/8 : GeForce GTX 980 Ti
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Compute capability: 5.2
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Num multiprocessors: 22
2016-10-06 15:23:04: 0: STDOUT: [Redshift] PCI busID: 4, deviceID: 0, domainID: 0
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Theoretical memory bandwidth: 336.480011 GB/Sec
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Measured PCIe bandwidth (pinned CPU->GPU): 3.734444 GB/s
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Measured PCIe bandwidth (pinned GPU->CPU): 3.164830 GB/s
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Measured PCIe bandwidth (paged CPU->GPU): 3.587930 GB/s
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Measured PCIe bandwidth (paged GPU->CPU): 5.033083 GB/s
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (0): 0.013842 ms
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (1): 0.013315 ms
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (2): 0.008581 ms
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (3): 0.008449 ms
2016-10-06 15:23:04: 0: STDOUT: [Redshift] New CUDA context created
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Available memory: 5699.3125 MB out of 6077.3750 MB
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Initializing GPUComputing module (CUDA). Ordinal 1
2016-10-06 15:23:04: 0: STDOUT: [Redshift] CUDA Ver: 8000

2016-10-06 15:23:04: 0: STDOUT: [Redshift] Device 2/8 : GeForce GTX 980 Ti
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Compute capability: 5.2
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Num multiprocessors: 22
2016-10-06 15:23:04: 0: STDOUT: [Redshift] PCI busID: 5, deviceID: 0, domainID: 0
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Theoretical memory bandwidth: 336.480011 GB/Sec
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Measured PCIe bandwidth (pinned CPU->GPU): 5.265391 GB/s
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Measured PCIe bandwidth (pinned GPU->CPU): 3.529710 GB/s
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Measured PCIe bandwidth (paged CPU->GPU): 3.286990 GB/s
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Measured PCIe bandwidth (paged GPU->CPU): 5.071782 GB/s
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (0): 0.014596 ms
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (1): 0.014719 ms
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (2): 0.016304 ms
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (3): 0.014752 ms
2016-10-06 15:23:04: 0: STDOUT: [Redshift] New CUDA context created
2016-10-06 15:23:04: 0: STDOUT: [Redshift] Available memory: 5767.4375 MB out of 6077.7500 MB
2016-10-06 15:23:05: 0: STDOUT: [Redshift] Initializing GPUComputing module (CUDA). Ordinal 2
2016-10-06 15:23:05: 0: STDOUT: [Redshift] CUDA Ver: 8000

2016-10-06 15:23:05: 0: STDOUT: [Redshift] Device 3/8 : GeForce GTX 980 Ti
2016-10-06 15:23:05: 0: STDOUT: [Redshift] Compute capability: 5.2
2016-10-06 15:23:05: 0: STDOUT: [Redshift] Num multiprocessors: 22
2016-10-06 15:23:05: 0: STDOUT: [Redshift] PCI busID: 8, deviceID: 0, domainID: 0
2016-10-06 15:23:05: 0: STDOUT: [Redshift] Theoretical memory bandwidth: 336.480011 GB/Sec
2016-10-06 15:23:05: 0: STDOUT: [Redshift] Measured PCIe bandwidth (pinned CPU->GPU): 10.262206 GB/s
2016-10-06 15:23:05: 0: STDOUT: [Redshift] Measured PCIe bandwidth (pinned GPU->CPU): 7.233344 GB/s
2016-10-06 15:23:05: 0: STDOUT: [Redshift] Measured PCIe bandwidth (paged CPU->GPU): 2.917630 GB/s
2016-10-06 15:23:05: 0: STDOUT: [Redshift] Measured PCIe bandwidth (paged GPU->CPU): 1.262746 GB/s
2016-10-06 15:23:05: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (0): 0.030813 ms
2016-10-06 15:23:05: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (1): 0.018031 ms
2016-10-06 15:23:05: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (2): 0.017355 ms
2016-10-06 15:23:05: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (3): 0.014718 ms
2016-10-06 15:23:05: 0: STDOUT: [Redshift] New CUDA context created
2016-10-06 15:23:05: 0: STDOUT: [Redshift] Available memory: 5767.4375 MB out of 6077.7500 MB
2016-10-06 15:23:05: 0: STDOUT: [Redshift] Initializing GPUComputing module (CUDA). Ordinal 3
2016-10-06 15:23:05: 0: STDOUT: [Redshift] CUDA Ver: 8000
2016-10-06 15:23:05: 0: STDOUT: [Redshift] Device 4/8 : GeForce GTX 980 Ti

2016-10-06 15:23:05: 0: STDOUT: [Redshift] Compute capability: 5.2
2016-10-06 15:23:05: 0: STDOUT: [Redshift] Num multiprocessors: 22
2016-10-06 15:23:05: 0: STDOUT: [Redshift] PCI busID: 9, deviceID: 0, domainID: 0
2016-10-06 15:23:05: 0: STDOUT: [Redshift] Theoretical memory bandwidth: 336.480011 GB/Sec
2016-10-06 15:23:05: 0: STDOUT: [Redshift] Measured PCIe bandwidth (pinned CPU->GPU): 4.354003 GB/s
2016-10-06 15:23:05: 0: STDOUT: [Redshift] Measured PCIe bandwidth (pinned GPU->CPU): 3.055894 GB/s
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Measured PCIe bandwidth (paged CPU->GPU): 1.430131 GB/s
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Measured PCIe bandwidth (paged GPU->CPU): 1.638408 GB/s
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (0): 0.023706 ms
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (1): 0.024728 ms
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (2): 0.026575 ms
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (3): 0.025886 ms
2016-10-06 15:23:06: 0: STDOUT: [Redshift] New CUDA context created
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Available memory: 5767.4375 MB out of 6077.7500 MB
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Initializing GPUComputing module (CUDA). Ordinal 4
2016-10-06 15:23:06: 0: STDOUT: [Redshift] CUDA Ver: 8000
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Device 5/8 : GeForce GTX 980 Ti

2016-10-06 15:23:06: 0: STDOUT: [Redshift] Compute capability: 5.2
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Num multiprocessors: 22
2016-10-06 15:23:06: 0: STDOUT: [Redshift] PCI busID: 131, deviceID: 0, domainID: 0
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Theoretical memory bandwidth: 336.480011 GB/Sec
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Measured PCIe bandwidth (pinned CPU->GPU): 10.882689 GB/s
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Measured PCIe bandwidth (pinned GPU->CPU): 6.997245 GB/s
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Measured PCIe bandwidth (paged CPU->GPU): 1.953577 GB/s
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Measured PCIe bandwidth (paged GPU->CPU): 2.272115 GB/s
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (0): 0.037778 ms
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (1): 0.030086 ms
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (2): 0.019099 ms
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (3): 0.012297 ms
2016-10-06 15:23:06: 0: STDOUT: [Redshift] New CUDA context created
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Available memory: 5767.4375 MB out of 6077.7500 MB
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Initializing GPUComputing module (CUDA). Ordinal 5
2016-10-06 15:23:06: 0: STDOUT: [Redshift] CUDA Ver: 8000
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Device 6/8 : GeForce GTX 980 Ti

2016-10-06 15:23:06: 0: STDOUT: [Redshift] Compute capability: 5.2
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Num multiprocessors: 22
2016-10-06 15:23:06: 0: STDOUT: [Redshift] PCI busID: 132, deviceID: 0, domainID: 0
2016-10-06 15:23:06: 0: STDOUT: [Redshift] Theoretical memory bandwidth: 336.480011 GB/Sec
2016-10-06 15:23:07: 0: STDOUT: [Redshift] Measured PCIe bandwidth (pinned CPU->GPU): 11.277424 GB/s
2016-10-06 15:23:07: 0: STDOUT: [Redshift] Measured PCIe bandwidth (pinned GPU->CPU): 6.180534 GB/s
2016-10-06 15:23:07: 0: STDOUT: [Redshift] Measured PCIe bandwidth (paged CPU->GPU): 1.533541 GB/s
2016-10-06 15:23:07: 0: STDOUT: [Redshift] Measured PCIe bandwidth (paged GPU->CPU): 4.110238 GB/s
2016-10-06 15:23:07: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (0): 0.015370 ms
2016-10-06 15:23:07: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (1): 0.015822 ms
2016-10-06 15:23:07: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (2): 0.017610 ms
2016-10-06 15:23:07: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (3): 0.013428 ms
2016-10-06 15:23:07: 0: STDOUT: [Redshift] New CUDA context created
2016-10-06 15:23:07: 0: STDOUT: [Redshift] Available memory: 5767.4375 MB out of 6077.7500 MB
2016-10-06 15:23:07: 0: STDOUT: [Redshift] Initializing GPUComputing module (CUDA). Ordinal 6
2016-10-06 15:23:07: 0: STDOUT: [Redshift] CUDA Ver: 8000
2016-10-06 15:23:07: 0: STDOUT: [Redshift] Device 7/8 : GeForce GTX 980 Ti

2016-10-06 15:23:07: 0: STDOUT: [Redshift] Compute capability: 5.2
2016-10-06 15:23:07: 0: STDOUT: [Redshift] Num multiprocessors: 22
2016-10-06 15:23:07: 0: STDOUT: [Redshift] PCI busID: 135, deviceID: 0, domainID: 0
2016-10-06 15:23:07: 0: STDOUT: [Redshift] Theoretical memory bandwidth: 336.480011 GB/Sec
2016-10-06 15:23:07: 0: STDOUT: [Redshift] Measured PCIe bandwidth (pinned CPU->GPU): 10.746902 GB/s
2016-10-06 15:23:07: 0: STDOUT: [Redshift] Measured PCIe bandwidth (pinned GPU->CPU): 7.103100 GB/s
2016-10-06 15:23:07: 0: STDOUT: [Redshift] Measured PCIe bandwidth (paged CPU->GPU): 3.847854 GB/s
2016-10-06 15:23:07: 0: STDOUT: [Redshift] Measured PCIe bandwidth (paged GPU->CPU): 1.555237 GB/s
2016-10-06 15:23:07: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (0): 0.134678 ms
2016-10-06 15:23:08: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (1): 0.046141 ms
2016-10-06 15:23:08: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (2): 0.012746 ms
2016-10-06 15:23:08: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (3): 0.013795 ms
2016-10-06 15:23:08: 0: STDOUT: [Redshift] New CUDA context created
2016-10-06 15:23:08: 0: STDOUT: [Redshift] Available memory: 5767.4375 MB out of 6077.7500 MB
2016-10-06 15:23:08: 0: STDOUT: [Redshift] Initializing GPUComputing module (CUDA). Ordinal 7
2016-10-06 15:23:08: 0: STDOUT: [Redshift] CUDA Ver: 8000
2016-10-06 15:23:08: 0: STDOUT: [Redshift] Device 8/8 : GeForce GTX 980 Ti

2016-10-06 15:23:08: 0: STDOUT: [Redshift] Compute capability: 5.2
2016-10-06 15:23:08: 0: STDOUT: [Redshift] Num multiprocessors: 22
2016-10-06 15:23:08: 0: STDOUT: [Redshift] PCI busID: 136, deviceID: 0, domainID: 0
2016-10-06 15:23:08: 0: STDOUT: [Redshift] Theoretical memory bandwidth: 336.480011 GB/Sec
2016-10-06 15:23:08: 0: STDOUT: [Redshift] Measured PCIe bandwidth (pinned CPU->GPU): 9.768453 GB/s
2016-10-06 15:23:08: 0: STDOUT: [Redshift] Measured PCIe bandwidth (pinned GPU->CPU): 4.299793 GB/s
2016-10-06 15:23:08: 0: STDOUT: [Redshift] Measured PCIe bandwidth (paged CPU->GPU): 1.596275 GB/s
2016-10-06 15:23:08: 0: STDOUT: [Redshift] Measured PCIe bandwidth (paged GPU->CPU): 1.582923 GB/s
2016-10-06 15:23:08: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (0): 0.036990 ms
2016-10-06 15:23:08: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (1): 0.010945 ms
2016-10-06 15:23:08: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (2): 0.012484 ms
2016-10-06 15:23:08: 0: STDOUT: [Redshift] Estimated GPU->CPU latency (3): 0.013389 ms
2016-10-06 15:23:08: 0: STDOUT: [Redshift] New CUDA context created
2016-10-06 15:23:08: 0: STDOUT: [Redshift] Available memory: 5871.9375 MB out of 6077.7500 MB
2016-10-06 15:23:13: 0: STDOUT: [Redshift] Loading Redshift procedural extensions…
2016-10-06 15:23:13: 0: STDOUT: [Redshift] Done!
2016-10-06 15:23:13: 0: STDOUT: [Redshift] Redshift for Maya 2016
2016-10-06 15:23:13: 0: STDOUT: [Redshift] Version 2.0.50, Jul 12 2016
2016-10-06 15:23:13: 0: STDOUT: [Redshift] renderable camera = |persp
2016-10-06 15:23:13: 0: STDOUT: [Redshift] Rendering frame 2 (1/1)
2016-10-06 15:23:13: 0: STDOUT: [Redshift] Maya evaluation manager mode: parallel

Edit1:

Deadline Client & Repo Version - 8.1.4.6
http://i66.tinypic.com/2cdzpg9.png

The logs do not contain

Could you save out the full job log report from Deadline Monitor and post it here please?

Hi Mike,

Sorry for the late reply, Here is the output of the full log

Hi,

Thanks for the log. Can you try this updated MayaBatch.py plugin file for me?

[Backup first in-case]

Unzip and overwrite the attached file, to this location, overwriting the same named file there already:

“<your_repo>/plugins/MayaBatch/MayaBatch.py”

and give it another whirl? (The job log should print something this time about setting GPU affinity…?)

MayaBatch.py.zip (24.1 KB)

Hi Mike,

I’ve swapped out the MayaBatch.py files and rebooted but the GPU Affinity still isn’t working,

Heres another Error log form a slave, it does look like redshift is outputting more information than the last log.

pastebin.com/kL6CMmaf

Cheers

Ryan

Something isn’t right here. There should have been a message either way, reporting what GPU’s are or are not being used, if the GPU override is enabled and configured as per your first screen-grab you posted.

Did the file copy process go ok, copying the MayaBatch.py file onto your Deadline Repository, plugins directory, “MayaBatch” directory? Right?

Please find attached another version, which has many DEBUG print statements in it. If we don’t see at least one of these debug statements when the job next runs, then the py script is not getting copied over or being used for some other reason, which we will need to investigate.

So, same instructions as before:

Unzip and overwrite the attached file, to this location, overwriting the same named file there already:

“<your_repo>/plugins/MayaBatch/MayaBatch.py”

MayaBatch.py.zip (24.2 KB)

Hey Mike,

It seems like it’s only happening when I manually try and submit a job through Deadline, when submitting through Maya using the plugin the GPU Affinity works without issues. This will work for us as we don’t normally submit manually through Deadline and were only doing it because we were having issues submitting through Maya.

There where no errors when copying the MayaBatch.Py file, unzipped and copied to -> \LICENSE-SERVER3\Deadline8-Repo\plugins\MayaBatch which is our repository.

Chances are it is something I’m missing or doing wrong when submitting manually through Deadline.

Here are some screenshots of my manual Submission settings.
i66.tinypic.com/2uzdr35.jpg
i66.tinypic.com/2qbvsk2.png
i67.tinypic.com/29qi1yf.png
i66.tinypic.com/33o5n5i.png

Submission through Maya
pastebin.com/0Xxd3t3E

Manual Submission with Latest MayaBatch.py Script
pastebin.com/kh7dE21r

Ah, I see. I think you just need to make sure in the Monitor submitter UI, in the first tab, make sure you select the correct “Renderer = Redshift”, then GPU rendering will be enabled. Submitting a Maya job to Deadline outside of Maya, we have no way to detect what renderer is being used, so we need the user to tell us what render engine to use. With “Redshift” enabled, then we can enable various GPU settings and we know to respect any GPU affinity override settings that might have been configured per Slave.

Hey Mike,

Had to wait for a render to finish on the system before testing again, after choosing redshift when submitting it worked without any issues, I assumed it would of picked up the information you set in Maya so that’s why I wasn’t setting it.

Everything seems to be working away happily now!

Cheers for your time and help!

Ryan

Cool! Glad its all sorted for you now. Feel free to revert that MayaBatch file, as I’m sure you don’t want my name splattered all over your log reports! :laughing: