Deadline Workers Not Using All Logical Processors

dplace · March 21, 2023, 9:19pm

Hi there,

I’m running into an issue where some of the Deadline workers in our farm are not using all of their available CPU logical processors, causing major performance degradation on those machines.

When I start a local render on one of these workers, all logical processors activate as expected and the render time matches that of machines where this issue isn’t happening.

When a worker picks up the same task it will only use the last 8 logical processors according to task manager / resource monitor.

I’ve tried setting CPU affinity to all on, all off, etc… from both the monitor, and from the launcher on the worker machine.

I’ve tried reinstalling the client, repository, and even gone as far as doing a clean install of Windows on the worker I’m testing with, but am still experiencing the same results.

Here are the relevant specs:

Worker:

Intel i9-12900K
2x NVIDIA GeForce RTX 3080 Ti
Windows 10 Pro 21H2, build 19044.2728
Deadline Client Version: 10.1.23.6 Release (773a6289d)
FranticX Client Version: 2.4.0.0 Release (14916bbc6)

Repo:

Repository Version: 10.1.23.6 (773a6289d)
Integration Version: 10.1.23.6 (773a6289d)
3PL Settings Version: 22/07/2022

DCC / DL Plugin:

Blender 3.3.1 LTS
Cycles / GPU Compute

Please let me know if you need more information.
Any insight would be immensely helpful and greatly appreciated

Thanks,
Darren

eamsler · March 22, 2023, 1:58pm

Hey Darren, the CPU affinitiy in Deadline is using some older Windows APIs. If it’s enabled, Deadline usually clamps to just the first 64 cores of the machine. THe BIGlittle architecture stuff seems to be different though and has newer issues.

The best thing to do at the moment is keep it disabled. If that doens’t work we’ll have a challenge here but I have some weird ideas about how to work around it.

Is there a reason you’re trying to use CPU affinity (assuming it’s not broken when disabled)?

dplace · March 22, 2023, 2:31pm

Hi Edwin,
Thanks for your response!

The reason I had enabled CPU affinity in the first place is because we were experiencing the same behavior with it off. I thought maybe overriding the default settings could get things in a working state. Unfortunately we saw the same results even with a brand new repo, db, and windows install on the worker node, before the GPU Affinity dialog was ever opened

eamsler · March 22, 2023, 2:43pm

That’s strange… We have a code path that shouldn’t touch anything. If you copy and paste the “Full Command” from the log into a command prompt, is it still clamped?

The crazy idea is to make a batch file that calls start to reset the affinity and have Deadline use that instead of Blender directly. It’s also an alternate way to configure environments. Some customers did that for this problem when Autodesk was being helpful if I recall correctly: (not quite relevant, just linking the topics here)

dplace · March 22, 2023, 2:55pm

Thanks for this!

Pasting the full command into CMD works! It’s so great to see all the cores working again

So you’re suggesting we wrap the full command in a batch script that calls Blender with start to overcome the issue?

zainali · March 22, 2023, 7:11pm

Hello @dplace

I think @eamsler was trying to isolate the issue from Deadline. I see it renders fine outside of Deadline via CMD i.e. it uses all the CPU cores. This is really weird, Deadline should be able to use all the cores when there is no CPU affinity applied.

Have you applied OS level affinity to the Worker application? Like if you from task manager if you right click the Worker> set affinity> Is there any affinity set there?
Are you running more than one Workers on the machine in question?

To work it around as @eamsler has suggested, you will need to make a batch script to that calls start to reset the affinity and it also run Blender. Then you need to give this script’s path in the plugin configuration: Monitor> Tools> Configure Plugins> Blender> Blender Executable> put the path to batch script at the top and test.

dplace · March 22, 2023, 9:44pm

Hey @zainali

No OS level affinity is applied.
Only one worker is being run on the machine, thought once this issue is resolved, the goal is to run two workers to take advantage of the dual GPUs in each worker.

I’m going to do one last clean install of the OS on both the worker and the test repo machine to ensure nothing has been misconfigured on our end.

I’ll follow up when that’s complete. If that doesn’t work, we can certainly explore implementing @eamsler 's solution

Will let you know the results of the reinstall.

dplace · March 23, 2023, 6:39pm

Hey @zainali and @eamsler,

I was finally able to uncover the root cause of this…

I implemented the batch script as suggested, and set the affinity hexadecimal value to what I needed for all 24 logical processors to be active, only to find the exact same result (only the last 8 logical processors would show activity in task manager)!

I can’t confirm this, but I believe that the work of 24 LPs was being put on those last 8, causing massive scheduling collisions, hence the slowness…

I believe that the issue comes down to the way Windows 10 handles the architecture of the Intel 12900k. The math in the CPU affinity functionality doesn’t play nice with the “P cores” and “E cores” paradigm…

The solution was to disable the E cores in the bios, then CPU affinity was working as expected.

Losing those extra LPs isn’t a huge issue for us as we’re rendering on the GPU anyway, but I thought I’d share my findings as I’m not sure if this is a Windows API issue or a Deadline issue.

Let me know if you’d like any other info to investigate

Thanks again for you assistance!
Darren

zainali · March 23, 2023, 7:03pm

Thanks for testing this. I believe it is a Windows 10 issue. It has been reported before and was resolved by updating the OS to Windows 11, please take a look at this forums post: