This is easily reproducible behavior for our setup, but when you set the group to a job it will take up to 10 mins for the change to actually reflect with the workers being picked up.
Example: Job is crashing 64GB nodes, so I limit it to only the 128GB by setting it to the 128-ram-group, I check the “Find Render Candidates” and everything looks good–the 64GB machines are shown as being unavailable for the job due to the group. But then for the next 5-10mins I watch and requeue 64GB nodes off of it.
Has anyone else had this experience or is there any setting I need to check? I guess I could do some reading up on how jobs are balanced/dequeued, but this seems like a bug.
Hello! The ‘find render canditates’ button runs the dequeue logic immediately, but doesn’t reflect the cache the Workers are working off of. If you suspend the job → change the group → resume the job the 5-10 minute wait should be circumvented.
The Workers assume that the group is unlikely to change in the life of a job, so it’s check very rarely. Hence the long wait for updates!
1 Like
Ohhhhhh, I see. For context, we’re never sure if a job will need 64/128/256GB until it begins failing, which is when we change the group. Another workaround we’ve used for emergencies is machine limiting the jobs manually to the workers of the correct ram group, which I’m assuming is unlike the group and the workers don’t assume it.
(Thanks as always Justin_B!!!)
1 Like