AWS Thinkbox Discussion Forums

[Deadline8] Enforce sequential fail?

Hi there,

As far as i understand, the ‘enforce sequential rendering’ option of jobs has a secret side-effect, that single machines rendering a job will not leave their jobs, even if something with a higher priority/weight pops up in the farm. Did this behavior change? We see sim machines pop around quite a bit, even though this setting is enabled…

Capture2.PNG

This adds minute long delays into the sim process (and when its a spatially parallel clustered simulation, that ruins the parallel sim,… as the machines are synchronized. So as one machine pops off its job, and another takes it, now all machines in the other clusters have to wait till it starts max, loads the file etc)

Maybe i misunderstand, but with the ‘rendering task buffer’=2, single machine sim jobs should never lose their slave?

The log of lapro1422 when it moves to another job (the job’s id is: 59a6d3997a3a9fea20dda02a):

2017-08-30 08:04:48:  0: INFO: sVrayLogFile: %TEMP%\VRayLog.txt
2017-08-30 08:04:48:  0: INFO: Network log file does not exist: %TEMP%\VRayLog.txt
2017-08-30 08:04:48:  0: Done executing plugin command of type 'Render Task'
2017-08-30 08:04:48:  0: Render time for frame(s): 1.570 m
2017-08-30 08:04:48:  0: Total time for task: 1.772 m
2017-08-30 08:04:49:  0: Saving task log...
2017-08-30 08:04:50:  Scheduler Thread - Render Thread 0 completed its task
2017-08-30 08:04:50:  Scheduler Thread - Seconds before next job scan: 1
2017-08-30 08:04:51:  Scheduler - Performing Job scan on Primary Pools with scheduling order Pool, Weighted, Balanced
2017-08-30 08:04:51:  Scheduler - Using enhanced scheduler balancing
2017-08-30 08:04:51:  Scheduler - Preliminary check: Slave is not whitelisted for 598ba6c03e137238b4db3ff5 limit.
2017-08-30 08:04:51:  Scheduler - Preliminary check: Slave is not whitelisted for 59724e9dcc361910288c444d limit.
2017-08-30 08:04:52:  Scheduler - Preliminary check: Slave is not whitelisted for 5994e1113e13722aa43f93b0 limit.
2017-08-30 08:04:52:  Scheduler - Preliminary check: Slave is not whitelisted for mx_workstation_gui limit.
2017-08-30 08:04:52:  Scheduler - Returning 59a49ec6bb74cfb220e619f0
2017-08-30 08:04:52:  Scheduler - Successfully dequeued 1 task(s).  Returning.
2017-08-30 08:04:52:  0: Shutdown
2017-08-30 08:04:52:  0: Exited ThreadMain(), cleaning up...
2017-08-30 08:04:52:  0: Executing plugin command of type 'End Job'
2017-08-30 08:04:52:  0: INFO: End Job called - shutting down 3dsmax plugin
2017-08-30 08:04:57:  0: WARNING: Timed out waiting for the renderer to close.
2017-08-30 08:04:57:  0: WARNING: Did not receive a success message in response to EndJob: 
2017-08-30 08:04:57:  0: INFO: Disconnecting socket connection to 3dsmax
2017-08-30 08:04:57:  0: INFO: Waiting for 3dsmax to shut down
2017-08-30 08:04:57:  0: INFO: 3dsmax has shut down
2017-08-30 08:04:57:  0: Done executing plugin command of type 'End Job'
2017-08-30 08:04:57:  0: Stopped job: [TEST] TST_000_0001_v0009_lse_packet10_cache_flowline_Element_0 [MAXSCRIPT]
2017-08-30 08:04:57:  0: Unloading plugin: 3dsmax
2017-08-30 08:04:58:  Scheduler - Returning limit stubs not in use.
2017-08-30 08:04:58:  Scheduler -   returning rendernodes
2017-08-30 08:04:58:  Scheduler -   returning 59a6d3997a3a9fea20dda02a
2017-08-30 08:04:58:  Scheduler -   returning win7
2017-08-30 08:05:00:  Scheduler Thread - Synchronization time for job files: 62.398 ms
2017-08-30 08:05:00:  Syncronizing Plugin 3dsmax from \\inferno2\deadline\repository8\plugins\3dsmax
2017-08-30 08:05:01:  Syncronization complete. Took: 1 seconds
2017-08-30 08:05:01:  0: Got task!
2017-08-30 08:05:01:  0: Plugin will be reloaded: Auxiliary files for the current job have been modified.
2017-08-30 08:05:02:  0: Loaded plugin 3dsmax (\\inferno2\deadline\repository8\plugins\3dsmax)
2017-08-30 08:05:03:  0: Executing plugin command of type 'Initialize Plugin'
2017-08-30 08:05:03:  0: INFO: Executing plugin script 'S:\deadlinedata\lapro1422\plugins\59a6d17bbb649b96488ffd89\3dsmax.py'

As far as I know it should still work this way. I’ve pinged the core lead to see if something was purposely changed.

Mmmm, yeah, there definitely weren’t any intentional changes made to the behaviour of Sequential Jobs. This sounds like a regression, we’ll look into it right away.

Cheers,
Jon

Thanks Jon! Its using 8.0.17.1 for your reference

The log snippet you provided gives us some hints as to what might be going on. It’s hard to know for sure without getting a little more detail, though. Can you post the full log from the Slave that picked up the sequential job and then erroneously moved to the other job?

In particular, I’m interested in the scheduler logic that would have occurred around when the Slave originally picked up the sequential job.

The properties of the job that it moved to would be helpful as well (including pool, secondary pool, group – the same screenshot as you posted for the first job).

Just an update here. Looks like the guys pushed code last night to fix this up. It’ll be in the next build of 10. Checking to see if it was a known problem in 9 and if we’ve had to back-port.

Hi Edwin,

Thanks for the feedback! I just noticed Eric’s response, do you guys still need those logs?

cheers
laszlo

The previous jobs mentioned in this thread have rolled off the queue, but here are some logs and props from an active job that shows the same symptoms:

Lapro1961A picks up the job, then drops it for another. Original job:
Capture.PNG

Its props:
Capture2.PNG

The job it moved to (has no group…):
Capture3.PNG

The original pickup log from lapro1961A (odd exception there…):

2017-09-12 15:48:57:  0: Done executing plugin command of type 'Render Task'
2017-09-12 15:48:57:  0: Render time for frame(s): 1.484 m
2017-09-12 15:48:57:  0: Total time for task: 1.500 m
2017-09-12 15:48:58:  0: Saving task log...
2017-09-12 15:48:59:  Scheduler Thread - Render Thread 0 completed its task
2017-09-12 15:48:59:  Scheduler Thread - Seconds before next job scan: 1
2017-09-12 15:49:00:  Scanline_NukeCacheCleanupEventListener.OnCall sSlaveName LAPRO1961A
2017-09-12 15:49:00:  Calling cleanupCache
2017-09-12 15:49:00:  cleanupCache called, Local Cache Root: S:/nc/
2017-09-12 15:49:00:  Calling cleanupSaveCache
2017-09-12 15:49:00:  cleanupSaveCache called
2017-09-12 15:49:00:  Info file name: S:/ns/_info
2017-09-12 15:49:00:  Info file found
2017-09-12 15:49:00:  Skipping save cache check, waitin 23.9165311111 hours...
2017-09-12 15:49:01:  Scheduler - Performing Job scan on Primary Pools with scheduling order Pool, Weighted, Balanced
2017-09-12 15:49:01:  Scheduler - Using enhanced scheduler balancing
2017-09-12 15:49:01:  Scheduler - Returning 59b861a67007cfbb68c6c27b
2017-09-12 15:49:01:  Scheduler - Returning 59b85e8ff91e5a1be4b9a054
2017-09-12 15:49:01:  Scheduler - Preliminary check: The 59b8633c2f733e8c60e4c826 limit is maxed out.
2017-09-12 15:49:01:  Scheduler - Returning 59b82e9cd9984954b8958b0b
2017-09-12 15:49:01:  Scheduler - Job chooser found no jobs.
2017-09-12 15:49:01:  Scheduler - Performing Job scan on Secondary Pools with scheduling order Pool, Weighted, Balanced
2017-09-12 15:49:01:  Scheduler - Using enhanced scheduler balancing
2017-09-12 15:49:01:  Scheduler - Successfully dequeued 1 task(s).  Returning.
2017-09-12 15:49:01:  0: Shutdown
2017-09-12 15:49:01:  0: Exited ThreadMain(), cleaning up...
2017-09-12 15:49:02:  0: Executing plugin command of type 'End Job'
2017-09-12 15:49:02:  0: INFO: End Job called - shutting down 3dsmax plugin
2017-09-12 15:49:04:  0: WARNING: Trapped SEH Exception in EndJob(): Access Violation
2017-09-12 15:49:04:  0: Process:  C:\3ds Max 2016\3dsmax.exe
2017-09-12 15:49:04:  0: Module:  C:\3ds Max 2016\stdplugs\ParticleView.dlo
2017-09-12 15:49:04:  0: Date Modified:  09/14/2016
2017-09-12 15:49:04:  0: Exception Code:  C0000005
2017-09-12 15:49:04:  0: Read Address:  00000000
2017-09-12 15:49:04:  0: Instruction:  48 8B 01 48 8B D9 FF 90 78 08 00 00 48 8B F8 48
2017-09-12 15:49:04:  0: Call Stack:
2017-09-12 15:49:04:  0: 30DC0000  C:\3ds Max 2016\stdplugs\ParticleView.dlo
2017-09-12 15:49:04:  0:   +00044CBA  Exception Offset
2017-09-12 15:49:04:  0: WARNING: Did not receive a success message in response to EndJob: ERROR: Trapped SEH Exception in EndJob(): Access Violation
2017-09-12 15:49:04:  0: Process:  C:\3ds Max 2016\3dsmax.exe
2017-09-12 15:49:04:  0: Module:  C:\3ds Max 2016\stdplugs\ParticleView.dlo
2017-09-12 15:49:04:  0: Date Modified:  09/14/2016
2017-09-12 15:49:04:  0: Exception Code:  C0000005
2017-09-12 15:49:04:  0: Read Address:  00000000
2017-09-12 15:49:04:  0: Instruction:  48 8B 01 48 8B D9 FF 90 78 08 00 00 48 8B F8 48
2017-09-12 15:49:04:  0: Call Stack:
2017-09-12 15:49:04:  0: 30DC0000  C:\3ds Max 2016\stdplugs\ParticleView.dlo
2017-09-12 15:49:04:  0:   +00044CBA  Exception Offset
2017-09-12 15:49:04:  0: INFO: Disconnecting socket connection to 3dsmax
2017-09-12 15:49:04:  0: INFO: Waiting for 3dsmax to shut down
2017-09-12 15:49:04:  0: INFO: 3dsmax has shut down
2017-09-12 15:49:04:  0: Done executing plugin command of type 'End Job'
2017-09-12 15:49:04:  0: Stopped job: [DRIFT] UF_095_0115_v0014_npf_test_images_render3d_FL-ShockwaveMist_0
2017-09-12 15:49:04:  0: Unloading plugin: 3dsmax
2017-09-12 15:49:05:  Scheduler - Returning limit stubs not in use.
2017-09-12 15:49:05:  Scheduler -   returning mx_render
2017-09-12 15:49:05:  Scheduler -   returning 59b824bf72f5ac4b500e7182
2017-09-12 15:49:05:  Scheduler -   returning vray3
2017-09-12 15:49:05:  Scheduler -   returning mx_flowlinerender
2017-09-12 15:49:05:  Scheduler -   returning vrayformaya
2017-09-12 15:49:05:  Scheduler -   returning win7
2017-09-12 15:49:06:  Scanline_NukeCacheCleanupEventListener.OnCall sSlaveName LAPRO1961A
2017-09-12 15:49:06:  Calling cleanupCache
2017-09-12 15:49:06:  cleanupCache called, Local Cache Root: S:/nc/
2017-09-12 15:49:06:  Calling cleanupSaveCache
2017-09-12 15:49:06:  cleanupSaveCache called
2017-09-12 15:49:06:  Info file name: S:/ns/_info
2017-09-12 15:49:06:  Info file found
2017-09-12 15:49:06:  Skipping save cache check, waitin 23.9150586111 hours...
2017-09-12 15:49:06:  Scanline: OnSlaveStartingJob sSlaveName LAPRO1961A
2017-09-12 15:49:06:  Scanline: OnSlaveStartingJob sJobID 59b8645dbb649b124430f6e1
2017-09-12 15:49:06:  Scanline: RemoveJobFolderContents event called LAPRO1961A
2017-09-12 15:49:06:  Scanline: cleaning sJobsDataFolder: C:\Users\scanlinevfx\AppData\Local\Thinkbox\Deadline8\slave\LAPRO1961A\jobsData
2017-09-12 15:49:06:  Scanline: removing : C:\Users\scanlinevfx\AppData\Local\Thinkbox\Deadline8\slave\LAPRO1961A\jobsData\59b824bf72f5ac4b500e7182
2017-09-12 15:49:06:  Scanline: RenderTasks sFileNameForDataIni : C:\ProgramData\Thinkbox\Deadline8\logs/LAPRO1961A.currentjobinfo
2017-09-12 15:49:06:  Scanline: Previous task didn't error, resetting counter
2017-09-12 15:49:06:  Scanline: OnSlaveStartingJob Resetting task ID
2017-09-12 15:49:06:  Scheduler Thread - Synchronization time for job files: 84.000 ms
2017-09-12 15:49:06:  Syncronizing Plugin 3dsmax from \\inferno2\deadline\repository8\plugins\3dsmax
2017-09-12 15:49:07:  Syncronization complete. Took: 0 seconds
2017-09-12 15:49:07:  0: Got task!
2017-09-12 15:49:07:  0: Plugin will be reloaded: Auxiliary files for the current job have been modified.
2017-09-12 15:49:08:  0: Loaded plugin 3dsmax (\\inferno2\deadline\repository8\plugins\3dsmax)
2017-09-12 15:49:09:  0: Executing plugin command of type 'Initialize Plugin'
2017-09-12 15:49:10:  0: INFO: Executing plugin script 'C:\Users\scanlinevfx\AppData\Local\Thinkbox\Deadline8\slave\lapro1961a\plugins\59b8645dbb649b124430f6e1\3dsmax.py'
2017-09-12 15:49:10:  0: INFO: About: 3dsmax Plugin for Deadline
2017-09-12 15:49:10:  0: INFO: The job's environment will be merged with the current environment before rendering
2017-09-12 15:49:10:  0: Done executing plugin command of type 'Initialize Plugin'
2017-09-12 15:49:10:  0: Start Job timeout is disabled.
2017-09-12 15:49:10:  0: Task timeout is 24000 seconds (Regular Task Timeout)

Then when it drops the job:

2017-09-12 16:06:58:  0: Render time for frame(s): 2.530 m
2017-09-12 16:06:58:  0: Total time for task: 2.545 m
2017-09-12 16:06:59:  0: Saving task log...
2017-09-12 16:07:00:  Scheduler Thread - Render Thread 0 completed its task
2017-09-12 16:07:00:  Scheduler Thread - Seconds before next job scan: 1
2017-09-12 16:07:01:  Scanline_NukeCacheCleanupEventListener.OnCall sSlaveName LAPRO1961A
2017-09-12 16:07:01:  Calling cleanupCache
2017-09-12 16:07:01:  cleanupCache called, Local Cache Root: S:/nc/
2017-09-12 16:07:01:  Calling cleanupSaveCache
2017-09-12 16:07:01:  cleanupSaveCache called
2017-09-12 16:07:01:  Info file name: S:/ns/_info
2017-09-12 16:07:01:  Info file found
2017-09-12 16:07:01:  Skip any size checks as long as cygwin is missing on VMs.
2017-09-12 16:07:01:  removing cache with commands: [u'del /f/s/q "S:\\ns" > nul', u'rmdir /s/q "S:\\ns"']
2017-09-12 16:07:01:  Running: del /f/s/q "S:\ns" > nul
2017-09-12 16:07:01:  Running: rmdir /s/q "S:\ns"
2017-09-12 16:07:01:  cmd completed with exit status 0
2017-09-12 16:07:02:  Scheduler - Performing Job scan on Primary Pools with scheduling order Pool, Weighted, Balanced
2017-09-12 16:07:02:  Scheduler - Using enhanced scheduler balancing
2017-09-12 16:07:02:  Scheduler - Preliminary check: Slave is not whitelisted for mx_workstation_gui limit.
2017-09-12 16:07:02:  Scheduler - Successfully dequeued 1 task(s).  Returning.
2017-09-12 16:07:02:  0: Shutdown
2017-09-12 16:07:02:  0: Exited ThreadMain(), cleaning up...
2017-09-12 16:07:02:  0: Executing plugin command of type 'End Job'
2017-09-12 16:07:02:  0: INFO: End Job called - shutting down 3dsmax plugin
2017-09-12 16:07:02:  0: INFO: scLog: info versionizeAPI.fnPostRendererChanged
2017-09-12 16:07:02:  0: INFO: scLog: info versionizeAPI.setVFBLUT
2017-09-12 16:07:03:  0: INFO: scLog: info setRayCasterMemLimit Result: Vray Dynamic Memory Limit automatically increased to 40GB!
2017-09-12 16:07:03:  0: INFO: scLog: info setHDRITexMemLimit Result: VrayHDRIVray Texture Cache Size set to 32GB! Renders may take too long, consider increasing.
2017-09-12 16:07:03:  0: INFO: scLog: info versionizeAPI.fnPostRendererChanged
2017-09-12 16:07:03:  0: INFO: scLog: info versionizeAPI.setVFBLUT
2017-09-12 16:07:03:  0: INFO: scLog: info setRayCasterMemLimit Result: Vray Dynamic Memory Limit below 64GB, render may take too long.
2017-09-12 16:07:03:  0: INFO: scLog: info setHDRITexMemLimit Result: VrayHDRIVray Texture Cache Size below 64GB, render may take too long.
2017-09-12 16:07:03:  0: INFO: scLog: info versionizeAPI.fnPostRendererChanged
2017-09-12 16:07:03:  0: INFO: scLog: info versionizeAPI.setVFBLUT
2017-09-12 16:07:03:  0: INFO: scLog: info setRayCasterMemLimit Result: Vray Dynamic Memory Limit below 64GB, render may take too long.
2017-09-12 16:07:03:  0: INFO: scLog: info setHDRITexMemLimit Result: VrayHDRIVray Texture Cache Size below 64GB, render may take too long.
2017-09-12 16:07:03:  0: INFO: scLog: info versionizeAPI.fnPostRendererChanged
2017-09-12 16:07:03:  0: INFO: scLog: info versionizeAPI.setVFBLUT
2017-09-12 16:07:03:  0: INFO: scLog: info setRayCasterMemLimit Result: Vray Dynamic Memory Limit below 64GB, render may take too long.
2017-09-12 16:07:03:  0: INFO: scLog: info setHDRITexMemLimit Result: VrayHDRIVray Texture Cache Size below 64GB, render may take too long.
2017-09-12 16:07:07:  0: WARNING: Timed out waiting for the renderer to close.
2017-09-12 16:07:07:  0: WARNING: Did not receive a success message in response to EndJob: 
2017-09-12 16:07:07:  0: INFO: Disconnecting socket connection to 3dsmax
2017-09-12 16:07:07:  0: INFO: Waiting for 3dsmax to shut down
2017-09-12 16:07:07:  0: INFO: 3dsmax has shut down
2017-09-12 16:07:07:  0: Done executing plugin command of type 'End Job'
2017-09-12 16:07:07:  0: Stopped job: [FALL] KCA_000_0560_v0303_lle_F0_01_cache_flowline_WFHappyS14_1
2017-09-12 16:07:07:  0: Unloading plugin: 3dsmax
2017-09-12 16:07:08:  Scheduler - Returning limit stubs not in use.
2017-09-12 16:07:08:  Scheduler -   returning rendernodes
2017-09-12 16:07:08:  Scheduler -   returning mx_flowlinesim
2017-09-12 16:07:08:  Scheduler -   returning mx_28cores
2017-09-12 16:07:08:  Scheduler -   returning 59b8645dbb649b124430f6e1
2017-09-12 16:07:08:  Scanline_NukeCacheCleanupEventListener.OnCall sSlaveName LAPRO1961A
2017-09-12 16:07:08:  Calling cleanupCache
2017-09-12 16:07:08:  cleanupCache called, Local Cache Root: S:/nc/
2017-09-12 16:07:08:  Calling cleanupSaveCache
2017-09-12 16:07:08:  cleanupSaveCache called
2017-09-12 16:07:08:  Info file name: S:/ns/_info
2017-09-12 16:07:08:  Info file found
2017-09-12 16:07:08:  Skipping save cache check, waitin 23.9979741667 hours...
2017-09-12 16:07:09:  Scanline: OnSlaveStartingJob sSlaveName LAPRO1961A
2017-09-12 16:07:09:  Scanline: OnSlaveStartingJob sJobID 59b86895af3138203c0078c2
2017-09-12 16:07:09:  Scanline: RemoveJobFolderContents event called LAPRO1961A
2017-09-12 16:07:09:  Scanline: cleaning sJobsDataFolder: C:\Users\scanlinevfx\AppData\Local\Thinkbox\Deadline8\slave\LAPRO1961A\jobsData
2017-09-12 16:07:09:  Scanline: removing : C:\Users\scanlinevfx\AppData\Local\Thinkbox\Deadline8\slave\LAPRO1961A\jobsData\59b8645dbb649b124430f6e1
2017-09-12 16:07:09:  Scanline: RenderTasks sFileNameForDataIni : C:\ProgramData\Thinkbox\Deadline8\logs/LAPRO1961A.currentjobinfo
2017-09-12 16:07:09:  Scanline: Previous task didn't error, resetting counter
2017-09-12 16:07:09:  Scanline: OnSlaveStartingJob Resetting task ID
2017-09-12 16:07:09:  Scheduler Thread - Synchronization time for job files: 56.000 ms
2017-09-12 16:07:09:  Syncronizing Plugin MayaBatch from \\inferno2\deadline\repository8\plugins\MayaBatch
2017-09-12 16:07:09:  Syncronization complete. Took: 0 seconds
2017-09-12 16:07:09:  0: Got task!

Ill find a few other examples as well, to see whether the 2nd job the slave goes not having a group is a common pattern.

Another one i found was also where the slave went over to a ‘groupless’ job…

Thanks for the new data. That was exactly what I was looking for.

What is happening is that the sequential Job is being picked up during the secondary scan (note the log output and that the secondary pool for that first job is ‘3d’). When the Slave finishes a task and performs another primary scan, it finds the second job in the primary scan – the job it moved to has ‘3d’ as its primary pool.

Normally, the current job is moved to the front of the scheduling queue if sequential rendering is on. However, since that first Job does not show up in the primary scan (‘3d’ is not its primary pool), it is not there to be moved to the front. That is what leads the scheduling to hop onto that second job.

This corner case is a bug in our scheduling logic and we’ll get that fixed up.

Awesome , thank you!
So even if the machine picks up a job using the secondary pool (which may be a lower priority pool on the box), would it still stick to its current job post-fix, if another job from a higher priority pool is sent to the farm? (it would be preferred if it stuck)

The intended behaviour is that no matter how the job was picked up and no matter what other jobs are available, the preferred job to “stick” to would be the sequential one that the Slave is currently rendering.

Awesome, thanks! That’s exactly what we are after.

Privacy | Site terms | Cookie preferences