Redshift GPUs per task bug?

Tronotrond · October 6, 2019, 10:05pm

Hi! I’ve noticed one particular issue with setting the GPUs per task.
I have two workstations, with two GPUs in each box. Running Linux Mint, Houdini 17.5.399 (tried across several subversions) and Deadline 10.0.28.2. Redshift 3.0.8.

Normally I submit each job with 1 GPU per task, 2 concurrent tasks. Especially for quick test renders.
But if I go back to the default 1 concurrent task and 0 GPU per task, it will just utilize one GPU. If I go in any monitoring software. I use the nvidia-smi command line tool - only the main GPU will render, while the secondary is idle.

Any tips?

eamsler · October 16, 2019, 10:17pm

Hmm. I notice that we only call the “-gpu” flag when an override is in place. I bet that the flag is saving something that persists across runs.

Finding information on that flag is fairly difficult… I need to find what the “use the default” option is for that flag, and add it to a new “else” block in “[repo]\plugins\Houdini\Houdini.py”:

        gpuList = self.GetGpuOverrides()
        if len( gpuList ) > 0:
            
            gpus = ",".join( gpuList )
            arguments += " -gpu " + gpus

            #... other stuff is here
        else:
            arguments += " -gpu (default or something?)"

Tronotrond · October 18, 2019, 1:49pm

Thanks!
Yeah, that could make sense.

I am going to do some more testing myself, but I have a theory (that I need to re-test to confirm)
Like you’re mentioning, I wonder if GPU affinity 0 is the issue, and it sticks to the previous value somehow. My last testing I could set the GPU affinity to 2 (in the monitor) and it would kick off both cards. Then back to 0, and it worked.
But from what I remember, going from 1 back to 0 still just uses one card.

I’ll try to test this more myself, hopefully in the weekend and report back.

Tronotrond · October 21, 2019, 3:42am

Hm… think I found the reason for the strange behavior. Perhaps…
It seems when forcing RS to render with one GPU (I guess this is done via redshift command line commands?) it alters the redshift preferences.xml in the Redshift folder to only use one GPU.

> <preference name="AllCudaDevices" type="string" value="0:GeForce RTX 2080 Ti,1:GeForce GTX 1070," />
> <preference name="SelectedCudaDevices" type="string" value="1:GeForce GTX 1070," />

So I guess this gets sticky until you actually re-enable both GPUs. Which you can do in the Redshift menu in Houdini. It requires a restart, so not ideal at all. Which I guess explains why I’m seeing them working when I set GPU affinity back to 2.

eamsler · October 28, 2019, 6:28pm

Good researching! So, the question then is how can we re-enable for all GPUs… I don’t think changing the XML file is the best course of action.

Tronotrond · October 28, 2019, 6:52pm

Hehe, I don’t know… I’m guessing RS is changing the XML? They suggested leaving the file read-only, but that can/will cause issues with updates and upgrades. Sounds like a pretty bad “solution”…
Do Thinkbox have a communication line with them, where you can work out something?

martin_acht · November 6, 2019, 9:41am

I noticed this exact same bug on our farm yesterday.

After deleting the preferences.xml on he affected slave (which was set to 1 selected CUDA device for some reason). It was written new after submitting a GPU affinity: 0 job and it went back to using all of the GPU.

Even after submitting a GPU affinity: 1 job with multiple concurrent task, the preferences.xml didn’t change this time.

If I find the trigger for the problem I’ll let you know.

Tronotrond · November 6, 2019, 3:03pm

Thanks for reporting that! Happy to hear it’s not just me, haha.

To me it does look like if you ever force a job to use any number of GPUs besides 0 or -1, it will update the XML file. Which GPU it chooses seems to be a different question though. (I’ve always wondered if I can force rendering on the 2nd GPU while leaving my primary/1st GPU idle on my workstation.)

Deleting the preferences is not very good if you have any custom settings there. Now that I know about this bug, I fix it by submitting (or changing) a job to use 2 GPUs per task. I assume you just need to set it to a high number equal to or more than your GPUs. Once a node kicks off a task, it’ll update the preferences.

Besides the Rendering/Deadline issue, this is also negative on a mixed render/workstation node. These settings do pass into the 3D applications. There’s been so many times I’ve been scratching my head as why RS IPR is only rendering with one card. I thought it was stuck driver issue, but here is the explanation.

martin_acht · November 6, 2019, 3:56pm

how does it behave when you set your GPU count to lets say 4, or 6. But some of your slaves only have 2 GPUs. will they limit themselves?

Tronotrond · November 6, 2019, 4:12pm

I haven’t tried that, but I would assume they just render with what’s available.

martin_acht · November 28, 2019, 4:47pm

i’ll take it all back, after it worked for some time, the same bug reappeared and GPU managment is not working again.

Its really frustrating, if i find some time i need to take it to support

eamsler · January 28, 2020, 3:18pm

Well, we’re about stuck here too. I’m tapping industry contacts but nothing’s come up yet.

FredP · January 28, 2020, 9:53pm

We found a workaround solution.

if you want to render 1 task per gpu, you can select 1 task per gpu in your submission parameters.

if you would like to then later submit a job that uses all GPU’s on a single task, then submit your job with your gpu task set to a larger number like 12.

Assuming your machine has less than 12 gpu’s, the redshift prefs will be updated to use all available gpu’s on the render machine. A machine with 2 Gpus will get set to 2, a machine with 8 gpus will be set to 8.

Its a workaround that works. But what it boils down to is :

If GPU Per task = 0 - then redshift will not update the redshift prefs.

if GPU per task != 0 then update the redshift prefs.

!! Assuming you have your maximum concurrent tasks configured per slave, this combination is working for us.

Let me know if this helps you.

martin_acht · March 1, 2021, 1:37pm

thanks a lot, that’s a decent workaround.

After i read in the Redshift 3D forum that using the -gpu argument in the rendercommand, will change the preferences.xml file and a statement of a Redshift3D guys saying to not use that argument for assigning certain GPUs to tasks, but rather set the REDSHIFT_GPUDEVICES environment variable, we now have a working solution to this problem.

with support i created a custom houdini.py plugin where i changes 2 lines of code.
commenting out the -gpu argument and insertign the REDSHIFT_GEPUDEVICES one.

so now, it works as expected.

    if len( gpuList ) > 0:
        
        gpus = ",".join( gpuList )
        #arguments.append( "-gpu %s" % gpus )
        self.SetEnvironmentVariable('REDSHIFT_GPUDEVICES', gpus)
        self.LogInfo('mmr: REDSHIFT_GPUDEVICES='+ gpus)