Worker Stalling - <Worker Name> is not in allow list for limit <Job ID>

Christopher_Kornosky · May 16, 2023, 2:38pm

Hello! This happens often and conveniently seems to happen overnight (for us). We have two repositories running at the moment, and it happens with both. I haven’t seen any consistency in which workers this happens with, as usually it’s just a nuisance and we requeue it, then it works. From browsing through related forum posts, I understand what the underlying issue is but unsure how to fix it, but it’s where the limits are being incorrectly held for a worker and this causes both the worker and job task to be in limbo until it gets requeued, thus releasing the (hidden) limit.

The only two things to go off of is this this warning pops up occasionally in the log and then the log itself, which just repeats endlessly:

2023-05-16 07:01:09:  WARNING: Stub for Limit '646386a1bef2af0decbabbf5' with holder name 'c05b01n02' could not be found.

2023-05-16 07:30:24:  Scheduler - Previously-acquired limits: [Deadline.LimitGroups.LimitGroupStub]
2023-05-16 07:30:24:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '644ab61ebef2af0decba5675'
2023-05-16 07:30:24:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '64593a54bef2af0decba8e10'
2023-05-16 07:30:24:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '645ab7d9bef2af0decba9b52'
2023-05-16 07:30:24:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '645ae02dbef2af0decba9ee3'
2023-05-16 07:30:24:  Scheduler - Job '6463025fbef2af0decbabbc3' is not eligible because it has no queued Tasks.
2023-05-16 07:30:24:  Scheduler - Job scan acquired 0 tasks in 0s after evaluating 5 different jobs.
2023-05-16 07:30:24:  Scheduler - Limits held after Job scan: [6463025fbef2af0decbabbc3]
2023-05-16 07:30:24:  Scheduler - DequeueTasks found no jobs - the tuple queue has no more jobs to check.
2023-05-16 07:30:24:  Scheduler Thread - Seconds before next job scan: 54
2023-05-16 07:30:46:  1: SandboxedPlugin still waiting for SandboxThread to exit
2023-05-16 07:31:16:  1: SandboxedPlugin still waiting for SandboxThread to exit
2023-05-16 07:31:19:  Scheduler - Previously-acquired limits: [Deadline.LimitGroups.LimitGroupStub]
2023-05-16 07:31:19:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '644ab61ebef2af0decba5675'
2023-05-16 07:31:19:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '64593a54bef2af0decba8e10'
2023-05-16 07:31:19:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '645ab7d9bef2af0decba9b52'
2023-05-16 07:31:19:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '645ae02dbef2af0decba9ee3'
2023-05-16 07:31:19:  Scheduler - Job '6463025fbef2af0decbabbc3' is not eligible because it has no queued Tasks.
2023-05-16 07:31:19:  Scheduler - Job scan acquired 0 tasks in 0s after evaluating 5 different jobs.
2023-05-16 07:31:19:  Scheduler - Limits held after Job scan: [6463025fbef2af0decbabbc3]
2023-05-16 07:31:19:  Scheduler - DequeueTasks found no jobs - the tuple queue has no more jobs to check.
2023-05-16 07:31:19:  Scheduler Thread - Seconds before next job scan: 51
2023-05-16 07:31:46:  1: SandboxedPlugin still waiting for SandboxThread to exit
2023-05-16 07:32:10:  Scheduler - Previously-acquired limits: [Deadline.LimitGroups.LimitGroupStub]
2023-05-16 07:32:10:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '644ab61ebef2af0decba5675'
2023-05-16 07:32:10:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '64593a54bef2af0decba8e10'
2023-05-16 07:32:10:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '645ab7d9bef2af0decba9b52'
2023-05-16 07:32:10:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '645ae02dbef2af0decba9ee3'
2023-05-16 07:32:10:  Scheduler - Job '6463025fbef2af0decbabbc3' is not eligible because it has no queued Tasks.
2023-05-16 07:32:10:  Scheduler - Job scan acquired 0 tasks in 0s after evaluating 5 different jobs.
2023-05-16 07:32:10:  Scheduler - Limits held after Job scan: [6463025fbef2af0decbabbc3]
2023-05-16 07:32:10:  Scheduler - DequeueTasks found no jobs - the tuple queue has no more jobs to check.
2023-05-16 07:32:10:  Scheduler Thread - Seconds before next job scan: 48
2023-05-16 07:32:16:  1: SandboxedPlugin still waiting for SandboxThread to exit
2023-05-16 07:32:46:  1: SandboxedPlugin still waiting for SandboxThread to exit
2023-05-16 07:32:58:  Scheduler - Previously-acquired limits: [Deadline.LimitGroups.LimitGroupStub]
2023-05-16 07:32:58:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '644ab61ebef2af0decba5675'
2023-05-16 07:32:58:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '64593a54bef2af0decba8e10'
2023-05-16 07:32:58:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '645ab7d9bef2af0decba9b52'
2023-05-16 07:32:58:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '645ae02dbef2af0decba9ee3'
2023-05-16 07:32:58:  Scheduler - Job '6463025fbef2af0decbabbc3' is not eligible because it has no queued Tasks.
2023-05-16 07:32:58:  Scheduler - Job scan acquired 0 tasks in 0s after evaluating 5 different jobs.
2023-05-16 07:32:58:  Scheduler - Limits held after Job scan: [6463025fbef2af0decbabbc3]
2023-05-16 07:32:58:  Scheduler - DequeueTasks found no jobs - the tuple queue has no more jobs to check.
2023-05-16 07:32:58:  Scheduler Thread - Seconds before next job scan: 58
2023-05-16 07:33:16:  1: SandboxedPlugin still waiting for SandboxThread to exit
2023-05-16 07:33:46:  1: SandboxedPlugin still waiting for SandboxThread to exit
2023-05-16 07:33:56:  Scheduler - Previously-acquired limits: [Deadline.LimitGroups.LimitGroupStub]
2023-05-16 07:33:56:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '644ab61ebef2af0decba5675'
2023-05-16 07:33:56:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '64593a54bef2af0decba8e10'
2023-05-16 07:33:56:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '645ab7d9bef2af0decba9b52'
2023-05-16 07:33:56:  Scheduler - Preliminary check: 'C05B01N02' is not in allow list for limit '645ae02dbef2af0decba9ee3'
2023-05-16 07:33:56:  Scheduler - Job '6463025fbef2af0decbabbc3' is not eligible because it has no queued Tasks.
2023-05-16 07:33:56:  Scheduler - Job scan acquired 0 tasks in 0s after evaluating 5 different jobs.
2023-05-16 07:33:56:  Scheduler - Limits held after Job scan: [6463025fbef2af0decbabbc3]
2023-05-16 07:33:56:  Scheduler - DequeueTasks found no jobs - the tuple queue has no more jobs to check.
2023-05-16 07:33:56:  Scheduler Thread - Seconds before next job scan: 47

karpreet · May 17, 2023, 8:19pm

It looks like that worker’s scheduler is checking the job specific machine limit and the worker is not in the allow list under Job Options > Modify Job Properties > Machine Limit. This usually adds a Allowlist flag in the Job info file with all the machines allowed queue a job and worker that are not in the Allowlist would fail the preliminary checks in the Scheduler thread.
You can check Allowlist flag in the job’s Submission params under Job Options > Modify Job Properties > Submission Params, if you find that flag we would need to make that we are not adding the Machine limits to the job either from the Submitter UI or manual job submission steps. You can also monitor changes on the job setting under Monitor > Job options > View Job History.

Christopher_Kornosky · May 18, 2023, 11:56pm

Is there a way to prevent the worker from getting stuck in that cycle? Here’s a screenshot to provide more context:

I’m showing how the task went on for 10.5hrs just for that one worker, but note the worker complete the next job just fine, it simply needed a kick out of the loop. Also showing that there’s no machine limit on the job and you can also take my word that we don’t use the “Allow List” submission param.

And here is a look at the job history:

I think it might help if I go ahead and set up a generous task timeout though.

Christopher_Kornosky · July 24, 2023, 9:11pm

@karpreet do you have any other suggestions? This happens often and we’re just left requeuing them to knock the Scheduler out of that loop. There’s nothing unique about the jobs that have this issue and we never do anything to modify their Machine Limits directly.

Christopher_Kornosky · March 28, 2024, 6:02pm

Still having this issue unfortunately, had all threads crash on a job and then the worker stalled on the scheduling thread. Looking closer at the log, it seems to get into a loop where it hold and releasing limits to the same job (just different tasks). It’s also hellbent on checking the same job to see if it’s eligible, but has no queued tasks. So perhaps this is a “rare” occurrence where it gets stuck trying to pick up tasks from same job after it’s been picked up by another machine?: