machine limit=1, 2 machines picking up

LaszloSebo · September 8, 2015, 6:21pm

We have an odd issue that surfaces from time to time here for us… Our sequential, single machine jobs sometimes pick up on 2 machines at the same time. I’ve suspected one to be stalled, and thus another one becomes active, but we have some logs now that suggest otherwise.

For example:

lapro1813 dequeuing task 0:

2015-09-04 19:32:04:  0: Got task!
2015-09-04 19:32:04:  0: Plugin will be reloaded because a new job has been loaded, or one of the job files or plugin files has been modified
2015-09-04 19:32:04:  Constructor: 3dsmax
2015-09-04 19:32:04:  0: Loaded plugin 3dsmax (\\inferno2\deadline\repository7\plugins\3dsmax)
2015-09-04 19:32:04:  0: Start Job timeout is disabled.
2015-09-04 19:32:04:  0: Task timeout is 12000 seconds (Regular Task Timeout)
2015-09-04 19:32:04:  0: Loaded job: [INDY] FIS_000_0130_v0016_str_Rework_cache_flowline_Cloud_0  (55ea53c802949897e0146966)
... preload script execution, max init, file load activity
2015-09-04 19:35:34:  0: Plugin rendering frame(s): 1000-1004
2015-09-04 19:35:36:  0: INFO: Render Tasks called
2015-09-04 19:35:36:  0: INFO: STARTED

lapr0790 dequeuing task 1 right after:

2015-09-04 19:33:13:  0: Got task!
2015-09-04 19:33:13:  0: Plugin will be reloaded because a new job has been loaded, or one of the job files or plugin files has been modified
2015-09-04 19:33:13:  Constructor: 3dsmax
2015-09-04 19:33:13:  0: Loaded plugin 3dsmax (\\inferno2\deadline\repository7\plugins\3dsmax)
2015-09-04 19:33:13:  0: Start Job timeout is disabled.
2015-09-04 19:33:13:  0: Task timeout is 12000 seconds (Regular Task Timeout)
2015-09-04 19:33:13:  0: Loaded job: [INDY] FIS_000_0130_v0016_str_Rework_cache_flowline_Cloud_0  (55ea53c802949897e0146966)
... preload script execution, max init, file load activity
2015-09-04 19:36:36:  0: Plugin rendering frame(s): 1005-1009
2015-09-04 19:36:37:  0: INFO: Render Tasks called
2015-09-04 19:36:37:  0: INFO: STARTED

lapro1813 finishing its task0:

2015-09-04 19:37:27:  0: Render time for frame(s): 5.393 m
2015-09-04 19:37:27:  0: Total time for task: 5.598 m
2015-09-04 19:37:30:  0: Saving task log...
2015-09-04 19:37:31:  Scheduler Thread - Render Thread 0 completed its task
2015-09-04 19:37:31:  Scheduler Thread - Seconds before next job scan: 2
2015-09-04 19:37:33:  Scheduler - Performing Job scan on Primary Pools with scheduling order Pool, Weighted, Balanced
2015-09-04 19:37:33:  Scheduler - Using enhanced scheduler balancing
2015-09-04 19:37:33:  Scheduler - Limit Stub for '55ea53c802949897e0146966' (held by 'lapro1813') was removed.
2015-09-04 19:37:33:  Scheduler - Preliminary check: Slave is not whitelisted for 55d61247bb649a8bbcaa02d1 limit.
2015-09-04 19:37:33:  Scheduler - Preliminary check: Slave is not whitelisted for 55de16bdbb649a2750a438dd limit.
2015-09-04 19:37:33:  Scheduler - Returning 55e8e454b25c5ae6a8cc84b7
2015-09-04 19:37:33:  Scheduler - Returning 55e8f6a3b25c5a2178faf52f
2015-09-04 19:37:33:  Scheduler - Preliminary check: The 55ea2a9310f85222e0820cd7 limit is maxed out.
2015-09-04 19:37:33:  Scheduler - Preliminary check: The 55ea2a9710f852c50c58cc58 limit is maxed out.
2015-09-04 19:37:33:  Scheduler - Preliminary check: The 55ea2c5f10f852305c31e2be limit is maxed out.
... etc...

It dequeues another job after this.

Lapro0790 keeps going as expected:

2015-09-04 19:39:50:  0: Saving task log...
2015-09-04 19:39:51:  Scheduler Thread - Render Thread 0 completed its task
2015-09-04 19:39:51:  Scheduler Thread - Seconds before next job scan: 2
2015-09-04 19:39:53:  Scheduler - Performing Job scan on Primary Pools with scheduling order Pool, Weighted, Balanced
2015-09-04 19:39:53:  Scheduler - Using enhanced scheduler balancing
2015-09-04 19:39:53:  Scheduler - Successfully dequeued 1 task(s).  Returning.
2015-09-04 19:39:53:  Scheduler - Returning limit stubs not in use.
2015-09-04 19:39:53:  ScanlineEventListener _init__ called, wiping PYTHONPATH from deadlineslave's embedded python
2015-09-04 19:39:53:  Enabling render stat mail
2015-09-04 19:39:53:  Loading event plugin Scanline (\\inferno2\deadline\repository7\events\Scanline)
2015-09-04 19:39:53:  Scheduler Thread - Synchronizing job auxiliary files from \\inferno2\deadline\repository7\jobs\55ea53c802949897e0146966
2015-09-04 19:39:53:  Scheduler Thread - All job files are already synchronized
2015-09-04 19:39:53:  ScanlineEventListener _init__ called, wiping PYTHONPATH from deadlineslave's embedded python
2015-09-04 19:39:53:  Enabling render stat mail
2015-09-04 19:39:53:  Loading event plugin Scanline (\\inferno2\deadline\repository7\events\Scanline)
2015-09-04 19:39:56:  0: Got task!
2015-09-04 19:39:56:  0: Plugin already loaded: 3dsmax
2015-09-04 19:39:56:  0: Start Job timeout is disabled.
2015-09-04 19:39:56:  0: Task timeout is 12000 seconds (Regular Task Timeout)
2015-09-04 19:39:56:  0: Job already loaded: [INDY] FIS_000_0130_v0016_str_Rework_cache_flowline_Cloud_0 
2015-09-04 19:39:56:  0: Plugin rendering frame(s): 1010-1014
2015-09-04 19:39:57:  0: INFO: Render Tasks called
2015-09-04 19:39:57:  0: INFO: STARTED

Is it possible that the dequeue and actual limit stub acquirement is not happening at the same time? Could an excessively long preload script delay the stub acquiry?

jgaudet · September 9, 2015, 8:33pm

So the Limit Stub acquisition happens before the Slave even dequeues a Job, which is before any of the plugin preload stuff would happen. It’d be interesting to see the bit of the Slave log where it’s actually dequeueing the Task (ie, before the Task logs start), that’s where all the Limit Stub acquisition stuff happens.

I guess this is a Machine Limit, so it wouldn’t show up in the Limit panel, but if you’ve noticed this happen with ‘proper’ Limits – have you seen what the ‘Stubs In Use’ value is at that time? Or which Slaves are in the ‘Stub Holders’ column?

This line is also particularly interesting (in the log after job completion):

That message comes up when a Slave tries to ensure its Limit Stub is still valid, and it turns out it isn’t because the Slave is not in the ‘Stub Holders’ list for that Limit. I wonder if it’s a situation where the RepositoryRepair process thought the Slave was stalled and returned everything that Slave was holding onto… Were there stalled Slave reports for lapro1813?

It might also be that our RepositoryRepair is bugged and returning stubs it shouldn’t be… Could you check out the Pulse log around those times to see if any housecleaning/repo repair was going on at the time?

LaszloSebo · September 9, 2015, 9:15pm

BINGO!:

2015-09-04 19:33:05: Performing Orphaned Limit Stub Scan…
2015-09-04 19:33:05: Orphaned Limit Stub Scan - Loading limits
2015-09-04 19:33:05: Orphaned Limit Stub Scan - Loaded 12166 limits in 195.226 ms
2015-09-04 19:33:05: Orphaned Limit Stub Scan - Loading slave states
2015-09-04 19:33:05: Orphaned Limit Stub Scan - Loaded 2436 slave states in 238.508 ms
2015-09-04 19:33:06: Orphaned Limit Stub Scan - Removing holder ‘lapro0959-secondary’ from limit ‘nuke’: no associated stub was found.
2015-09-04 19:33:06: Orphaned Limit Stub Scan - Removing holder ‘lapro0619-secondary’ from limit ‘nuke’: no associated stub was found.
2015-09-04 19:33:06: Orphaned Limit Stub Scan - Removing holder ‘lapro0401-secondary’ from limit ‘nuke’: no associated stub was found.
2015-09-04 19:33:06: Orphaned Limit Stub Scan - Removing holder ‘lapro0447-secondary’ from limit ‘nuke’: no associated stub was found.
2015-09-04 19:33:06: Orphaned Limit Stub Scan - Removing holder ‘lapro1813’ from limit ‘55ea53c802949897e0146966’: no associated stub was found.

There are a few other job related “no associated stub found” entries too after that. But the times match, right after this happened, 0790 picked up.
So would this be a case of a bug in pulse, or maybe the stub acquiry didnt go through?

Nope.

No sorry, by the time an issue like this is escalated to me, its only logs…

jgaudet · September 10, 2015, 6:29pm

Sweet, that should narrow it down quite a bit. I’ll poke around the Repo Repair code and see what I find

LaszloSebo · September 15, 2015, 5:15pm

Here is another one, it happens sporadically, but fairly often:

“2015-09-15 09:30:11: Orphaned Limit Stub Scan - Removing holder ‘lapro0847’ from limit ‘55f846fa0294980b20c1963c’: no associated stub was found.”

jgaudet · September 30, 2015, 9:06pm

This should be fixed in 7.2.0.18 btw; we were being inconsistent with our slave name casing when cleaning up Limit stubs

LaszloSebo · October 1, 2015, 9:00pm

Glorious, thanks Jon! Now i just need some time to test & update