AWS Thinkbox Discussion Forums

[D8] machine limit = 1 not respected

This happens from time to time and causes corruption in our simulations.


lapro0962 partial log :

2017-10-22 09:34:00:  Scheduler - Performing Job scan on Primary Pools with scheduling order Pool, Weighted, Balanced
2017-10-22 09:34:00:  Scheduler - Using enhanced scheduler balancing
2017-10-22 09:34:00:  Scheduler - Slave has already completed a task for maintenance job 59ea1d9084266eadf829b398, skipping this job.
2017-10-22 09:34:00:  Scheduler - Returning 59e10ec34f9a2d5e9c7547d3
2017-10-22 09:34:00:  Scheduler - Returning 59e91b97a008043a901c49f0
2017-10-22 09:34:01:  Scheduler - Returning 59ea4b2a4f9a2d143823c712
2017-10-22 09:34:01:  Scheduler - Returning 59ea4b2d4f9a2d58dc001113
2017-10-22 09:34:01:  Scheduler - Returning 59ea4b2e4f9a2ddc1c53277f
2017-10-22 09:34:01:  Scheduler - Returning 59ea4b344f9a2d6b70ae556d
2017-10-22 09:34:01:  Scheduler - Returning 59eb16b3ac69e20fbcc874ec
2017-10-22 09:34:01:  Scheduler - Returning 59eb1bdbac69e21c6cf9b12c
2017-10-22 09:34:01:  Scheduler - Returning 59ebb5034f9a2d09d0353f55
2017-10-22 09:34:01:  Scheduler - Returning 59e2573c4f9a2dbd5034e457
2017-10-22 09:34:01:  Scheduler - Returning 59ea4b374f9a2d4a74160d04
2017-10-22 09:34:01:  Scheduler - Returning 59ea4b3a4f9a2db8c0ceead2
2017-10-22 09:34:01:  Scheduler - Returning 59ea4b414f9a2dbb5036589e
2017-10-22 09:34:01:  Scheduler - Returning 59ea6ea86e82719ce08ef60e
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59ea4b304f9a2d9b0835deeb limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Returning 59ea4b3e4f9a2d5168af648f
2017-10-22 09:34:01:  Scheduler - Returning 59e9e1f21dbf0317e0d9c97d
2017-10-22 09:34:01:  Scheduler - Returning 59ec0b5c71355a5ab4ef52c6
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59ec9385d1a83a23b4e1980a limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59ec937dd1a83a05d49ba466 limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59eb92a984414b962ca35b54 limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59e8001d72f5ac3c8c08a219 limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59ec53145a034f0cb87b057f limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59ec53295a034f3c4040ec31 limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59ec53195a034f7568701e8f limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Slave has been marked bad for job 59ec53235a034fc9685620f4, skipping this job.
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59e7fff972f5ac350c269620 limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59ebe8fed1a83a3e84c6c16d limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59eb95a384414bd22c32d5d2 limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59ec53245a034fe5500f4a66 limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59ec530f5a034f1f40b3a674 limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59ec531a5a034f17a02d77bc limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59ec531e5a034fe0c44e5a9c limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59ec53385a034f2688f6ad16 limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59eb95a784414b15f43b7f13 limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59eb92ad84414b5d6099574d limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59ec532e5a034f55288b5b34 limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59eb7e1e84414b6f38e26dd4 limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59eb7e2284414bb4b4d00a58 limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Performing Job scan on Secondary Pools with scheduling order Pool, Weighted, Balanced
2017-10-22 09:34:01:  Scheduler - Using enhanced scheduler balancing
2017-10-22 09:34:01:  Scheduler - Returning 59e10ec34f9a2d5e9c7547d3
2017-10-22 09:34:01:  Scheduler - Returning 59e91b97a008043a901c49f0
2017-10-22 09:34:01:  Scheduler - Returning 59ea4b2a4f9a2d143823c712
2017-10-22 09:34:01:  Scheduler - Returning 59ea4b2d4f9a2d58dc001113
2017-10-22 09:34:01:  Scheduler - Returning 59ea4b2e4f9a2ddc1c53277f
2017-10-22 09:34:01:  Scheduler - Preliminary check: The 59ea4b344f9a2d6b70ae556d limit is maxed out.
2017-10-22 09:34:01:  Scheduler - Returning 59eb16b3ac69e20fbcc874ec
2017-10-22 09:34:01:  Scheduler - Returning 59eb1bdbac69e21c6cf9b12c
2017-10-22 09:34:01:  Scheduler - Returning 59ebb5034f9a2d09d0353f55
2017-10-22 09:34:01:  Scheduler - Successfully dequeued 1 task(s).  Returning.
2017-10-22 09:34:01:  Scheduler - Returning limit stubs not in use.
2017-10-22 09:34:01:  Scheduler -   returning mx_render
2017-10-22 09:34:01:  Scheduler -   returning mx_fastest_machines
2017-10-22 09:34:01:  Scheduler -   returning mx_tpcache
2017-10-22 09:34:01:  Scheduler -   returning mx_28cores
2017-10-22 09:34:01:  Scheduler -   returning mx_flowlinesim
2017-10-22 09:34:01:  Scheduler -   returning mx_fastmachines
2017-10-22 09:34:02:  Scanline_NukeCacheCleanupEventListener.OnCall sSlaveName LAPRO0962
2017-10-22 09:34:02:  Calling cleanupCache
2017-10-22 09:34:02:  cleanupCache called, Local Cache Root: S:/nc/
2017-10-22 09:34:02:  Calling cleanupSaveCache
2017-10-22 09:34:02:  cleanupSaveCache called
2017-10-22 09:34:02:  Info file name: S:/ns/_info
2017-10-22 09:34:02:  Info file found
2017-10-22 09:34:02:  Skip any size checks as long as cygwin is missing on VMs.
2017-10-22 09:34:02:  removing cache with commands: [u'del /f/s/q "S:\\ns" > nul', u'rmdir /s/q "S:\\ns"']
2017-10-22 09:34:02:  Running: del /f/s/q "S:\ns" > nul
2017-10-22 09:34:02:  Running: rmdir /s/q "S:\ns"
2017-10-22 09:34:02:  cmd completed with exit status 0
2017-10-22 09:34:02:  Scanline: OnSlaveStartingJob sSlaveName LAPRO0962
2017-10-22 09:34:02:  Scanline: OnSlaveStartingJob sJobID 59ecc872029499cadcc33c26
2017-10-22 09:34:02:  Scanline: RemoveJobFolderContents event called LAPRO0962
2017-10-22 09:34:02:  Scanline: cleaning sJobsDataFolder: C:\Users\scanlinevfx\AppData\Local\Thinkbox\Deadline8\slave\LAPRO0962\jobsData
2017-10-22 09:34:02:  Scanline: removing : C:\Users\scanlinevfx\AppData\Local\Thinkbox\Deadline8\slave\LAPRO0962\jobsData\59ecc289c588582600c8d805
2017-10-22 09:34:02:  Scanline: RenderTasks sFileNameForDataIni : C:\ProgramData\Thinkbox\Deadline8\logs/LAPRO0962.currentjobinfo
2017-10-22 09:34:02:  Scanline: Previous task didn't error, resetting counter
2017-10-22 09:34:02:  Scanline: OnSlaveStartingJob Resetting task ID
2017-10-22 09:34:03:  Scheduler Thread - Synchronization time for job files: 62.396 ms
2017-10-22 09:34:03:  Syncronizing Plugin 3dsmax from \\inferno2.scanlinevfxla.com\deadline\repository8\plugins\3dsmax
2017-10-22 09:34:03:  Syncronization complete. Took: 0 seconds
2017-10-22 09:34:03:  0: Got task!
2017-10-22 09:34:04:  0: Plugin will be reloaded: Auxiliary files for the current job have been modified.
2017-10-22 09:34:05:  0: Loaded plugin 3dsmax (\\inferno2.scanlinevfxla.com\deadline\repository8\plugins\3dsmax)
2017-10-22 09:34:05:  0: Executing plugin command of type 'Initialize Plugin'
2017-10-22 09:34:05:  0: INFO: Executing plugin script 'C:\Users\scanlinevfx\AppData\Local\Thinkbox\Deadline8\slave\lapro0962\plugins\59ecc872029499cadcc33c26\3dsmax.py'
2017-10-22 09:34:06:  0: INFO: About: 3dsmax Plugin for Deadline
2017-10-22 09:34:06:  0: INFO: The job's environment will be merged with the current environment before rendering
2017-10-22 09:34:06:  0: Done executing plugin command of type 'Initialize Plugin'
2017-10-22 09:34:06:  0: Start Job timeout is disabled.
2017-10-22 09:34:06:  0: Task timeout is 3216 seconds (Regular Task Timeout)
2017-10-22 09:34:06:  0: Loaded job: [MEG] MM_000_0670_v0044_str_test3_cache_flowline_FL-Foam_0 (59ecc872029499cadcc33c26)
...
... Render Process
...
2017-10-22 09:36:28:  0: Done executing plugin command of type 'Render Task'
2017-10-22 09:36:28:  0: Render time for frame(s): 2.368 m
2017-10-22 09:36:28:  0: Total time for task: 2.443 m
2017-10-22 09:36:29:  0: Saving task log...

Log from lapro1715A:

2017-10-22 09:34:18:  Scheduler - Performing Job scan on Primary Pools with scheduling order Pool, Weighted, Balanced
2017-10-22 09:34:18:  Scheduler - Using enhanced scheduler balancing
2017-10-22 09:34:18:  Scheduler - Slave has already completed a task for maintenance job 59ea1d9084266eadf829b398, skipping this job.
2017-10-22 09:34:18:  Scheduler - Job chooser found no jobs.
2017-10-22 09:34:18:  Scheduler - Performing Job scan on Secondary Pools with scheduling order Pool, Weighted, Balanced
2017-10-22 09:34:18:  Scheduler - Using enhanced scheduler balancing
2017-10-22 09:34:18:  Scheduler - Returning 59e10ec34f9a2d5e9c7547d3
2017-10-22 09:34:18:  Scheduler - Preliminary check: The 59e91b97a008043a901c49f0 limit is maxed out.
2017-10-22 09:34:18:  Scheduler - Preliminary check: Slave is not whitelisted for mx_flowlinesim limit.
2017-10-22 09:34:18:  Scheduler - Returning 59ea4b2e4f9a2ddc1c53277f
2017-10-22 09:34:18:  Scheduler - Returning 59eb16b3ac69e20fbcc874ec
2017-10-22 09:34:18:  Scheduler - Returning 59eb1bdbac69e21c6cf9b12c
2017-10-22 09:34:18:  Scheduler - Returning 59ebb5034f9a2d09d0353f55
2017-10-22 09:34:18:  Scheduler - Returning 59e2573c4f9a2dbd5034e457
2017-10-22 09:34:18:  Scheduler - Returning 59ea6ea86e82719ce08ef60e
2017-10-22 09:34:18:  Scheduler - Returning 59e9e1f21dbf0317e0d9c97d
2017-10-22 09:34:18:  Scheduler - Returning 59ec0b5c71355a5ab4ef52c6
2017-10-22 09:34:18:  Scheduler - Successfully dequeued 1 task(s).  Returning.
2017-10-22 09:34:18:  Scheduler - Returning limit stubs not in use.
2017-10-22 09:34:18:  Scheduler -   returning mx_render
2017-10-22 09:34:18:  Scheduler -   returning mx_fastest_machines
2017-10-22 09:34:18:  Scheduler -   returning mx_tpcache
2017-10-22 09:34:18:  Scheduler -   returning mx_fastmachines
2017-10-22 09:34:19:  Scanline_NukeCacheCleanupEventListener.OnCall sSlaveName LAPRO1715A
2017-10-22 09:34:19:  Calling cleanupCache
2017-10-22 09:34:19:  cleanupCache called, Local Cache Root: S:/nc/
2017-10-22 09:34:19:  Calling cleanupSaveCache
2017-10-22 09:34:19:  cleanupSaveCache called
2017-10-22 09:34:19:  Info file name: S:/ns/_info
2017-10-22 09:34:19:  Info file found
2017-10-22 09:34:19:  Skip any size checks as long as cygwin is missing on VMs.
2017-10-22 09:34:19:  removing cache with commands: [u'del /f/s/q "S:\\ns" > nul', u'rmdir /s/q "S:\\ns"']
2017-10-22 09:34:19:  Running: del /f/s/q "S:\ns" > nul
2017-10-22 09:34:19:  Running: rmdir /s/q "S:\ns"
2017-10-22 09:34:19:  cmd completed with exit status 0
2017-10-22 09:34:19:  Scanline: OnSlaveStartingJob sSlaveName LAPRO1715A
2017-10-22 09:34:19:  Scanline: OnSlaveStartingJob sJobID 59ecc872029499cadcc33c26
2017-10-22 09:34:19:  Scanline: RemoveJobFolderContents event called LAPRO1715A
2017-10-22 09:34:19:  Scanline: cleaning sJobsDataFolder: C:\Users\scanlinevfx\AppData\Local\Thinkbox\Deadline8\slave\LAPRO1715A\jobsData
2017-10-22 09:34:19:  Scanline: removing : C:\Users\scanlinevfx\AppData\Local\Thinkbox\Deadline8\slave\LAPRO1715A\jobsData\59ec039b4f9a2d361c5c6d38
2017-10-22 09:34:19:  Scanline: RenderTasks sFileNameForDataIni : C:\ProgramData\Thinkbox\Deadline8\logs/LAPRO1715A.currentjobinfo
2017-10-22 09:34:19:  Scanline: Previous task didn't error, resetting counter
2017-10-22 09:34:19:  Scanline: OnSlaveStartingJob Resetting task ID
2017-10-22 09:34:19:  Scheduler Thread - Synchronization time for job files: 46.800 ms
2017-10-22 09:34:19:  Syncronizing Plugin 3dsmax from \\inferno2.scanlinevfxla.com\deadline\repository8\plugins\3dsmax
2017-10-22 09:34:20:  Syncronization complete. Took: 0 seconds
2017-10-22 09:34:20:  0: Got task!
2017-10-22 09:34:20:  0: Plugin will be reloaded: Auxiliary files for the current job have been modified.
2017-10-22 09:34:21:  0: Loaded plugin 3dsmax (\\inferno2.scanlinevfxla.com\deadline\repository8\plugins\3dsmax)
2017-10-22 09:34:22:  0: Executing plugin command of type 'Initialize Plugin'
2017-10-22 09:34:22:  0: INFO: Executing plugin script 'C:\Users\scanlinevfx\AppData\Local\Thinkbox\Deadline8\slave\lapro1715a\plugins\59ecc872029499cadcc33c26\3dsmax.py'
2017-10-22 09:34:23:  0: INFO: About: 3dsmax Plugin for Deadline
2017-10-22 09:34:23:  0: INFO: The job's environment will be merged with the current environment before rendering
2017-10-22 09:34:23:  0: Done executing plugin command of type 'Initialize Plugin'
...
...
...
2017-10-22 09:36:56:  0: Done executing plugin command of type 'Render Task'
2017-10-22 09:36:56:  0: Render time for frame(s): 2.555 m
2017-10-22 09:36:56:  0: Total time for task: 2.630 m
2017-10-22 09:36:57:  0: Saving task log...

We have seen this too in conjunction with limits being ignored on DL8 on Nuke.

Yeah something is really off with the limit handling…

Capture.PNG

These jobs had a limit of 50 each, which was reduced to 5/10/5, and then ~40 tasks requeued to put the active task count to below the limits. Despite that over the next 5-9 mins, more machines than the allocated limit picked them up again.

We fixed an issue in 10 recently where the Machine Limit didn’t allow multiple Slaves on a machine to render at the same time. So, if you had say eight Slaves and the first took a token the rest would sit idle (a problem for Redshift folks). It should be in SP5 if I remember right, but I’m not seeing it in the changelog.

That box in the job should limit to Slaves, so as long as concurrent tasks aren’t on we should be good to go. To reproduce: create a job with limit one, run multiple Slaves (I’ll do this all on my machine), and see if multiple Slaves pick up? Version is 10 I assume?

For Ricardo, I’ll go look up the thread but I thought there were issues with Nuke being open.

In our case all machines are running one slave. So machine A and machine B are also rendering the same job, even though the job has a limit of 1.

(we do have machines with multiple slaves, but they dont render the same type of jobs. primary slave is for 3d, secondary is nuke only)

Confirmed on 10 that things are working as expected at least. 8’s going to be another problem entirely since things have been fixed along the road in 9 and 10.

There’s no possibility to move toward 10 for these sim jobs I’m assuming? The only temporary workaround I can think of is to lock it to a specific machine, but that’s hard to automate.

You mentioned

being fixed in deadline 10, but thats not the issue we are having.

We are still unable to switch to deadline 10 due to the PYTHONPATH behavior change :-\

@Eamsler

Yeah our issue probably is a combination of circumstances but I can confirm that I have seen this issue where Deadline 8 sometimes (not always) ignores the machine limit

Privacy | Site terms | Cookie preferences