AWS Thinkbox Discussion Forums

GPU affinity affecting job after

Well to describe the issue with redshift and GPU affinity

4 GPUs in slave

Render job with 2 concurrent tasks and 2 GPUs per frame.
After that if I run a job with 1 concurrent tasks and affinity 0 it will run 1 concurrent task but use again only 2 GPUs

Also if you run application like maya or whatever in system tab only 2 GPUs are selected for use now.
It should actually after releasing job that has 2 concurrent tasks and 2 GPU per frame to “reset” slave so it uses all GPUs again. At least that is what I thought that using 0 as GPU affinity when submitting means?

Does my explanation makes any sense? see the issue? :slight_smile:

GPU affinity isn’t at the OS-level so we have to do interesting things on a per-app plugin basis. The code that controls GPUs on C4D for example, is not the same that controls it for Maya. I wonder if we need to control it with a file… I’ll have to ask.

Questions:

  1. Is this a problem in Maya?
  2. Does this happen all the time, or only when the Slave is forcibly closed / job is cancelled?

My guess here is that the cleanup code might be missed in some circumstance. Maybe if the affinity is zero we’re skipping over the cleanup.

This is Maya mostly but I think it will affect Softimage, max and Houdini at least. Not familiar with cinema4d really.

What happens outside of deadline is if you start program, set it up to use 2 gpus and then restart program it will then next session be setup to use those two GPUs. If you wanna change back to all 4 GPUs you have to set that up check all GPUs in system tab of redshift and again restart system for it to picks up settings. So this is all outside of deadline just how it works with all of these apps.

What what it seems to happen then when you setup 2 concurrent tasks and 2 GPUs in submitted in job one. That job is done but all those apps stays setup to use 2 GPUs only.
Now if I send another job, job2 but use 1 concurrent task and use 0 sa gpu affinity ie to use all but not telling it directly to use 4, then in this case seems that deadline doesn’t force those apps to use all gpus but they use what was setup at the latest submission with 2 GPUs.
At least this is what seems to me that is happening and if I would for that job 2 use 1 concurrent task and 4 GPUs affinity in submitter I think that would correct things.

I will test that out as soon asi have some extra time. But if that is the case then I would assume that most straight forward fix if possible would be that when 0 is used as GPU affinity inside submitter from applications that deadline force use of all GPUs somehow if it can figure out number of them for each render node that is.

Does it make any sense, it is probably a bit messy explanation of what I think is happening but please id it is too messy let me know will try again :slight_smile:

Biggest issue now if user submits job1 and use 2 GPU affinity from submitter, and after that there is job2 with 0 affinity submitter expecting to use all GPUs in the node. it will still stick and use only 2.

Hey mirkoj,

Few info gathering questions:

  • What Deadline version are you on?
  • What Maya version are you on?
  • Are you using the MayaCmd or MayaBatch plugin?
  • What Redshift version are you on?
  • Can you pass us some logs showcasing this behaviour? There’ll be lines indicating which GPU affinity it’ll try to use that I’m interested in. If you do not see a line containing “redshiftSelectCudaDevices”, then you’ll need to enable the MayaBatch plugin option: “Log Script Contents To Render Log”

In MayaBatch, for each task that a Slave runs, we run a command to select the gpus for redshift to use, even though MayaBatch will stay open between tasks. It won’t, however, stay open between jobs.

In MayaCmd, each task restarts Maya and we pass in a command-line flag to redshift to select the GPUs to use, this should be easily seen in the logs.

There really shouldn’t be anything “carrying over” from a previous Job. Hopefully this gives us a place to start for debugging the issue you’re experiencing.

Cheers

We’re having a similar issue with switching between Houdini/Redshift and Cinema/Redshift jobs.
Houdini will render fine with 2 tasks/2 gpus each but when a C4d scene starts up after with the same settings 2 of the 4 gpus are disabled and it’ll peg the 2 active gpus to 100% load and then the slave will hang.
I can go into the preferences.xml for redshift and see that it’s set globally to use only 2 of 4 gpus (and the same is true in the redshift prefs in cinema)
We’re on 10.0.10.4 with Redshift 2.5.62

@donberg, are you able to get some job reports when this happens? I suppose the first that modifies the GPU setting then the follow up? You can cut out any file paths I think here, but we’re looking for those settings. I suppose the preferences.xml is a pretty good smoking gun though…

I wonder if there’s a way to tell Redshift to not remember the settings we’ve passed it. If not, we’ll probably have to just change the plugins to always pass some kind of affinity which may take some time to implement.

There’s definitely known incompatibilities that arise when using the same GPU is being used by different tasks, so the slave stalling sounds about right to me.

Now, why those GPUs would be set in preferences.xml is what’s strange to me. I would assume that setting command-line arguments or process environment variables would not affect this file. Also, how are you going about choosing the GPUs (ie. what’s the set-up like on machine. Multiple slaves with gpus per task? Overriding GPU affinity?). If you’re using two slaves, without overriding GPU Affinity, and choosing “2 gpu per task”, then I could see them trying to use the same GPUs.

Cheers

Sorry guys, wandered off for a bit there, i’ll turn on notifications for this thread.

Ok, so we’ve got our main slaves working fine now - 2 tasks with 2 gpus per set in the submit.
The issue was having the gpu affinity override set per slave -I’ve had to disable that to get them to work properly (in local slave controls).

The redshift prefs issue does still come up with our workstations - however. Those machines have 3 gpus so while I want my render nodes (with 4) to use 2 gpus per task I want the workstations to use all 3. If I set the gpu affinity in the submit to 2 it changes the redshift prefs on the workstations and disables the 3rd gpu. I’ve tried setting the slave gpu affinity override to use all 3 but that doesn’t seem to work.

Any suggestions on how to set that up? I’ve got the workstations limited to 1 task, maybe that’s the issue?

Hmm. You should definitely try the two tasks situation…

I have to say, the juggling of Redshift GPUs is quite tricky! I’ve heard that performance drops off slightly after two, so maybe having the workstations operate in 2+1 GPUs will be worthwhile on the whole?

Privacy | Site terms | Cookie preferences