Maya Vray GPU Affinity

Tobias_Rosli · September 30, 2019, 10:43am

Hello,

I downloaded Deadline a few days ago. My goal was to build a small Farm with two nodes. I want to use this system for GPU rendering with Maya and Vray. I figured out that if you switch on GPU Affinity it will not work because it multiplies the amount of vram you use for the Scene with the Amount of Graphic Cards. So if the scene needs 2.5 GB of vram it needs 10 GB with Affinity. Is that true on your side or is it a Bug?

I use:

Deadline Client Version: 10.0.28.2 Release (31a4a2e50)
Vray for Maya Next Update 1
Maya 2018 SP6

Cheers
Tobias

kavi · November 22, 2019, 3:54pm

Hey @Tobias_Rosli

I think this is because Deadline’s Maya plugin doesn’t yet support GPU affinity for V-Ray

Tobias_Rosli · November 22, 2019, 4:15pm

Hi Kavi,

thanks for the info, I spoke with one of the Support Team and he told me that, and also that he started a Internal Request for implementing this Feature. I hope it comes soon. It would increase my render power. It’s already 1 Month ago…

Tobias_Rosli · October 17, 2020, 10:35am

Hello, so I tought I might ask after a year how far you guys are with implementing GPU affinity for Vray and Maya?

Cheers!

Tobias_Rosli · October 22, 2020, 10:34am

Hi, thanks for sharing the current situation. Hope that you’ll find some good solutions for the problems you’re facing right now.

Cheers!

Bobo · October 23, 2020, 3:22am

Ok, here is my WIP implementation of V-Ray GPU Affinity.
I have done only rudimentary testing, so please give it a try and let me know if anything does not behave as expected.

It uses the same basic code and UI as the Redshift implementation, with just a bit of V-Ray specific code to set the VRAY_GPU_PLATFORMS environment variable to the correct GPU indices.

VRayGPUAffinity_WIP_20201022.zip (117.1 KB)

The ZIP file contains 3 updated script files you need to deploy as follows:

MayaBatch.py goes into the Repository/Plugins/MayaBatch/ folder - this is the integration plugin
MayaSubmission.py goes into the Repository/Scripts/Submission/ folder - this is the Monitor Submitter.
SubmitMayaToDeadline.mel goes into Repository/Submission/Maya/Main/ folder - this is the integrated Maya submitter

Please BACK UP the original versions of the files before replacing them!

To test, I submitted a V-Ray GPU scene from Maya using the integrated submitter set to frames 1 to 8, 4 Concurrent tasks, 1 GPU Per Task. I run this on AWS using a g4dn.12xlarge instance which has 4 x T4 GPUs. The result was 4 Tasks rendered in parallel, each using only 1 GPU.

I repeated the test with different combinations of Concurrent Tasks and GPUs per Task using both the Monitor Submitter and the Integrated Submitter.

I also submitted a job with Selected GPUs, entering the indices by hand, e.g. “0,2” to render on the first and third GPUs with 1 Concurrent Task. As expected, the Task rendered on only the two specified GPUs.

Unfortunately, V-Ray indexes the devices in the log in consecutive order, so 0,2 reports as Device 0 and Device 1. But the correct physical GPUs end up rendering, so it seems to be working as expected.

Note that if you try to render more concurrent tasks than there are GPUs, some tasks will use the GPUs Per Task value, and the excess ones will render on all GPUs. For example, 6 Concurrent Tasks, 1 GPU Per Task will render Tasks 0,1,2,3 on GPUs 0,1,2,3, while Tasks 4 and 5 will both render on all 4 GPUs. This is As Designed.

Tobias_Rosli · October 23, 2020, 11:44am

wow, cool, I’ll test it in the next 2-3 Days and give you Feedback.

Cheers!

Tobias_Rosli · October 26, 2020, 3:38pm

Hi, I tried to use it but got some problems/questions

thats the value in the VRAY_GPU_PLATFORMS env variable:
nvidia cuda geforce rtx 2080 ti gpu index0;nvidia cuda geforce rtx 2080 ti gpu index1;

This are the submitter settings:

The log from Deadline is in the adeadline_log.zip (6.3 KB) ttachment

somehow, still both gpus are initialized for one job.
Am I missing something? Or am I supposed to create workers per gpu and GPU affinity override?

I replaced the Files like you said. I also restarted the Computer.

these are very strange lines in the log:
2020-10-26 16:20:57: 0: STDOUT: [2020/Oct/26|16:20:57] V-Ray: Device[0]: GeForce RTX 2080 Ti (WDDM mode) has compute capability 7.5. PCI Bus ID: 0000:0A:00.0
2020-10-26 16:20:57: 1: STDOUT: [2020/Oct/26|16:20:57] V-Ray: Device[0]: GeForce RTX 2080 Ti (WDDM mode) has compute capability 7.5. PCI Bus ID: 0000:42:00.0
2020-10-26 16:20:57: 0: STDOUT: [2020/Oct/26|16:20:57] V-Ray: Device[1]: GeForce RTX 2080 Ti (WDDM mode) has compute capability 7.5. PCI Bus ID: 0000:42:00.0

it detected twice gpu 0 and once gpu 1

Cheers

Bobo · October 26, 2020, 11:38pm

This appears to be a Worker log. I need to see the TASK log.
In this log, it has index 0: and 1: showing two threads were running:

This is thread 0 (first task):

2020-10-26 16:20:57:  0: STDOUT: [2020/Oct/26|16:20:57] V-Ray: Device[0]: GeForce RTX 2080 Ti (WDDM mode) has compute capability 7.5. PCI Bus ID: 0000:0A:00.0
2020-10-26 16:20:57:  0: STDOUT: [2020/Oct/26|16:20:57] V-Ray: Device[1]: GeForce RTX 2080 Ti (WDDM mode) has compute capability 7.5. PCI Bus ID: 0000:42:00.0

This is thread 1 (second task):

2020-10-26 16:20:57:  1: STDOUT: [2020/Oct/26|16:20:57] V-Ray: Device[0]: GeForce RTX 2080 Ti (WDDM mode) has compute capability 7.5. PCI Bus ID: 0000:42:00.0

So for some reason one task is rendering on one GPU, the other on both.

The most important thing missing is the log line where the Environment variables are being set. There is no sign of any environment variables being set.

Please post the individual logs of the two tasks processed at the same time by the same Worker. The environment variables are being set before the MayaBatch process is even started, and your log does not show that, as it happened before the place you copied from.

Here is an example of what my task log looks like:

2020-10-23 02:49:20: 0: INFO: Rendering with Maya Version 2018.0
2020-10-23 02:49:20: 0: INFO: Setting VRAY_GPU_PLATFORMS environment variable to 0 for this session
2020-10-23 02:49:20: 0: INFO: Setting Process Environment Variable VRAY_GPU_PLATFORMS to 0

Note that this overrides the actual ENV variable you can see on your OS level - even if you see all those GPU IDs listed in the environment, we set a temporary value for the process being launched (MayaBatch) to obscure the system-wide settings only while Deadline is rendering.

So all you would have to do is set the GPU per Task to 1, set the Concurrent Tasks to 2, and render. It sounds like you did that, but your results were surprising. So let’s look at the two individual Task logs to see what is being reported there…

(Select a Task, right-click, View Task Reports…, click a render log to open, then save to disk using the first icon in the toolbar).

Repeat for the next task.

Tobias_Rosli · October 27, 2020, 10:28am

Hi, in the attachment you find 3 frames from the task log.

I rendered 3 frames. After that I let the others fail. strange thing is that when I did that another log entry appeard in frame task 2 but this frame was already done with rendering.
You’ll find it in the zip.

I am running on Deadline 10.1.9.2 is that a problem?

cheers!

tasklog.zip (27.9 KB)

Bobo · October 27, 2020, 7:28pm

Thanks for the logs. Something is indeed not working in your environment.

I can see the env. variable being set to 0 in the log of Frame 1 rendered on thread 0. It still renders on both devices, which is wrong.
The log of Frame 2 being rendered on thread 1 shows a CUDA Error and a crash. The env. variable is being properly set to 1. We don’t know if it would have rendered on one or two GPUs.
The log of Frame 3 rendered on thread 1 shows the env. variable set to 1, and it renders on one GPU as it should. Since the scene did not previously load and render Frame 2 properly due to the CUDA crash, MayaBatch was reloaded and this time it worked right.
The log of Frame 2 being re-rendered at the same time on thread 0 shows no env. variables being set, and it uses 2 GPUs. This is likely because MayaBatch was already launched on thread 0 to render Frame 1, and it rendered incorrectly on 2 GPUs there, so the incorrect behavior from the first log persisted.

Questions:

Does the crash always occur?
If not, can you send me logs from 4 frames rendered with 1 GPU per Task, 2 Concurrent Tasks that do not contain a crash? I want to know what the behavior would be if frames 1,2 and 3,4 were rendered together.

I retested on my machine with 1 GPU Per Task, 4 Concurrent Tasks, and all frames rendered on 1 GPU each as expected. Frames 1,2,3,4 were rendered on the same Worker together and had the Env. variable set to 0,1,2 and 3 respectively. The frames 5,6,7,8 were rendered in another go, and the logs don’t show the Env. variable being set, because Maya stays loaded in memory and just moves to a different frame.

In the Job Properties, there is an option to reload the plugin between tasks. This would reload Maya and start everything from scratch, including the loading of the scene, and setting the env. variables of the MayaBatch process. This of course makes the rendering a bit slower.

I decided to test this option too, and I had the env. variable print in the log of all frames, including the later ones. I would love to know what happens on your system if you check the “Reload Plugin Between Tasks” in the Job Properties > General section.

Tobias_Rosli · October 29, 2020, 11:21am

Hi,

no, normally there’s no crash, I guess it’s because I let the remaining jobs fail.

I have done all the tests you requested.
I also included another node from the Farm which has also 2x 2080Ti. NV-Link is disabled on both nodes. Node_001 is the one from which I already sent you the other Logs. node_002 is the new one we almost identical Hardware. the CPU and Mainboard is different. First I had 4x 2080 Ti in the node_002 but I removed them because Nvidia disabled the support for 2x2 pair NV-Link with RTX 2080Ti

It’s strange that node_002 has 4 gpus in the env variable but I deleted the value and set it again with the “Select devices for V-Ray GPU rendering” tool from Vray. I also checkt the env variable that it has two gpus. I don’t know from where he grabs this variable but maybe there’s the Problem?

Cheers!
Tobias

logs_gpu_affinity.zip (109.8 KB)

Bobo · October 30, 2020, 4:38pm

I think I found my error.
Can you give this one a try?
MayaBatch.zip (31.4 KB)

Just replace the one in \\DeadlineRepository10\plugins\MayaBatch

Here is what happened:
V-Ray seems to perform simple pattern matching on the GPU ID string. For example, you can see that the environment variable on your Node0 was originally set to

nvidia cuda geforce rtx 2080 ti gpu index0;nvidia cuda geforce rtx 2080 ti gpu index1;

The correct way to pattern match this string to get the first GPU only would be to set the environment variable of the MayaBatch process to “index0”. This would match the first GPU, and not match the second GPU.

However, in my infinite lazyness I was setting the environment variable to just “0”, which of course would match the “0” in both “2080” strings. When setting it to “1” it was working, because there is no other “1” anywhere in the second GPU ID except for the “index1” part. For that reason, your first task would match both GPUs and render on two devices, while the second task would match only the second GPU and render on one device according to the log!

Since my own machine was running T4 GPUs, there was no “0” anywhere else in the string except for the “index0”:

nvidia cuda tesla t4 gpu index0;nvidia cuda tesla t4 gpu index1;nvidia cuda tesla t4 gpu index2;nvidia cuda tesla t4 gpu index3;

So it worked properly for me as it was matching the “0” only in the “index0” GPU, and nowhere else.

Sorry for wasting your time, I hope the new version will work properly.

Tobias_Rosli · October 30, 2020, 8:08pm

Hi,

I did some quick tests and it seams to work
I’ll do some more testing in the next time and send you some feedback if I find something.

No problem it’s not a wasting of Time. I’m glad that we now have this opportunity to use concurrent tasks with gpu.

Thanks a lot!
Tobias

Antonio_Milo · January 24, 2021, 12:09pm

Hi everyone, I’m in the same situation that I have Maya Vray GPU and want to use the GPU Affinity. I followed the advice by @Bobo and while it works on the first two machines, I get errors with the other 3 render nodes where they ‘waiting to start’, then just skip the task, and give me the error below:

Error: TypeError : cannot concatenate 'str' and 'NoneType' objects (Python.Runtime.PythonException)

i have attached the worker log, how do I fix this? Thanks for your help in advance!

Job_2021-01-24_12-01-31_600d61ba1536b69888f7c142.zip (1.7 KB)

Tobias_Rosli · January 24, 2021, 2:43pm

Hi @Antonio_Milo

have you installed the latest Deadline Version( 10.1.12.1)?
I saw that it is now official supported.

Cheers
Tobias

Antonio_Milo · January 24, 2021, 2:50pm

hi @Tobias_Rosli!
Ah great, Thanks for the tip! I’m on version 10.1.10.6 , will download the latest version now.

best,
Antonio

Bobo · January 24, 2021, 5:56pm

I had an error that snuck into that code - if the machine in question does not have the env. var. already set, the attempt to print what its value is fails because the return value is not a string.

The simplest fix, which will be hopefully in 10.1.13, but is not out yet, would be to locate the file mayaBatch.py in your Repository/plugins/MayaBatch/ folder, navigate to line 268 (or thereabout), and comment it out by placing a # in front of it.

The offending line looks like

self.LogInfo("Initial System Environment VRAY_GPU_PLATFORMS:"+Environment.GetEnvironmentVariable( "VRAY_GPU_PLATFORMS" ))

Change it to

# self.LogInfo("Initial System Environment VRAY_GPU_PLATFORMS:"+Environment.GetEnvironmentVariable( "VRAY_GPU_PLATFORMS" ))

Sorry for the inconvenience!

Antonio_Milo · January 24, 2021, 7:38pm

Thanks so much @Bobo That fixed the issue I was having!

I’m also having another problem with submitting jobs from my workstation, seems like permission, but I’ve opened everything up - I will create another thread for that.