Hey there, thanks for the detailed explanation,
I’m currently in communication with support via mail, because of the GPU affinity handling for Redshift3D. The Houdini.py Plugin of deadline currently uses the -gpu argument to specify the GPU for the task, but according to the Redshift Staff, this shouldn’t be used, because it will alter the “preferences.xml” file of redshift, to use a single GPU (this will interfere with other instances as well) and on top of that, when submitting a job for all GPU ( 0 ) the preferences.xml won’t get changed back to use all GPUs but will stay on a single GPU (from the last task it was rendering).
So to prevent that from happening, the REDSHIFT_GPUDEVICES environment variable should be used instead for handling the GPU affinity of the submitted tasks.
So far so good, Support sent me a patched Houdini.py file, which did exactly this (after I removed some minor formating error of the Device list, which was joining the list twice with commas)
But now I stumbled upon a strange behaviour I really need help to understand, why this happens and how to fix it.
Basically, if “Override GPU affinity” of the worker is turned on in the worker configuration, the handling of the GPU affinity for tasks that should use a single GPU isn’t working.
For a worker with 2 GPUs and Override GPU affinity set to 2 (0, 1) a job with the setting:
Concurrent Task = 1, GPU per task = 1, will start 2 Tasks, but both if them render on GPU Device 0, so the render gets slowed down, worst case it crashes because it renders 2 tasks on the same GPU at the same time.
When “Override GPU affinity” is unchecked, the Device allocation is working just fine,
First task, Device: 0 , second Task, Device: 1.
But leaving “Override GPU affinity” inactive for the workers, brings two problems:
First, when the Houdini.py Plugin checks for available GPUs, the list variable resultGPUs stays empty, thus when setting REDSHIFT_GPUDEVICES it will not get the full list of GPUs but an empty string. And i don’t see a way to query available GPUs of the worker because the DeadlinePlugin class method GpuAffinity(), will return nothing if “Override GPU affinity” is not set for the worker.
The second issue is that when leaving “Override GPU affinity” inactive, we have to prevent the workers from getting assigned more tasks than their GPU count so we are using: Override Concurrent Task Limit for the workers to make sure there can’t be more concurrent tasks than available GPUs (has to be manually set). But should a CPU job be run on the machine, this of course will also limit these Tasks, which is not intended.
I would really really appreciate help in this regards because not being able to transition from single GPU tasks to all GPU tasks smoothly is affecting our rendering output speed, and we have a crucial project that would benefit from having this fixed.
I’m a 3D generalist with an affinity to python, It may well be that I have misunderstood some concepts. But i can deal with in-depth answers to this topic i think.
atteached you find the slave log screenshot for the described cases (PS I added some warnings for debugging purposes):
thanks a lot,
all the best, Martin