job abandoned

LaszloSebo · May 19, 2016, 12:54am

Hi there,

Testing the 8.0.1.0 build on our staging server, very simple setup, 1 job, and a total of 1 slave.
I noticed however that the slave abandoned the job, just to rejoin it later (at the expense of a max restart and scene reload) at frame 1007:

The log from around that time. It starts with the last lines of the previous frame for this job (1006), then it says “Job chooser found no jobs”… errr why? the job is still there… .so much so that it picks it up right away again. There are no other active slaves in this repository and the job has no machine limits.

2016-05-18 17:44:11:  0: Done executing plugin command of type 'Render Task'
2016-05-18 17:44:11:  0: Render time for frame(s): 15.656 s
2016-05-18 17:44:11:  0: Total time for task: 19.708 s
2016-05-18 17:44:11:  0: Saving task log...
2016-05-18 17:44:12:  Scheduler Thread - Render Thread 0 completed its task
2016-05-18 17:44:12:  Scheduler Thread - Seconds before next job scan: 2
2016-05-18 17:44:14:  Scheduler Thread - Performing pending job scan...
2016-05-18 17:44:16:  Scheduler Thread - Performing repository repair...
2016-05-18 17:44:18:  Scheduler Thread - Performing house cleaning...
2016-05-18 17:44:20:  Scheduler - Performing Job scan on Primary Pools with scheduling order Pool, Weighted, Balanced
2016-05-18 17:44:20:  Scheduler - Using enhanced scheduler balancing
2016-05-18 17:44:20:  Scheduler - Limit Stub for '573d08187a3a9f1c40735b8a' (held by 'lapro3067') was removed.
2016-05-18 17:44:20:  Scheduler - Job chooser found no jobs.
2016-05-18 17:44:20:  Scheduler - Performing Job scan on Secondary Pools with scheduling order Pool, Weighted, Balanced
2016-05-18 17:44:20:  Scheduler - Using enhanced scheduler balancing
2016-05-18 17:44:20:  Scheduler - Job chooser found no jobs.
2016-05-18 17:44:20:  0: Shutdown
2016-05-18 17:44:21:  0: Exited ThreadMain(), cleaning up...
2016-05-18 17:44:21:  0: Executing plugin command of type 'End Job'
2016-05-18 17:44:21:  0: INFO: End Job called - shutting down 3dsmax plugin
2016-05-18 17:44:22:  0: INFO: scLog: info versionizeAPI.fnPostRendererChanged
2016-05-18 17:44:22:  0: INFO: scLog: info versionizeAPI.setVFBLUT
2016-05-18 17:44:23:  0: INFO: scLog: info versionizeAPI.fnPostRendererChanged
2016-05-18 17:44:23:  0: INFO: scLog: info versionizeAPI.setVFBLUT
2016-05-18 17:44:24:  0: INFO: scLog: info versionizeAPI.fnPostRendererChanged
2016-05-18 17:44:24:  0: INFO: scLog: info versionizeAPI.setVFBLUT
2016-05-18 17:44:24:  0: INFO: scLog: info versionizeAPI.fnPostRendererChanged
2016-05-18 17:44:24:  0: INFO: scLog: info versionizeAPI.setVFBLUT
2016-05-18 17:44:24:  0: INFO: Disconnecting socket connection to 3dsmax
2016-05-18 17:44:24:  0: INFO: Waiting for 3dsmax to shut down
2016-05-18 17:44:24:  0: INFO: 3dsmax has shut down
2016-05-18 17:44:24:  0: Done executing plugin command of type 'End Job'
2016-05-18 17:44:24:  0: Stopped job: [TEST] TST_000_0000_v0265_lse_test_images_render3d_render_0
2016-05-18 17:44:24:  0: Unloading plugin: 3dsmax
2016-05-18 17:44:26:  Scheduler Thread - Seconds before next job scan: 5
2016-05-18 17:44:33:  Scheduler - Performing Job scan on Primary Pools with scheduling order Pool, Weighted, Balanced
2016-05-18 17:44:33:  Scheduler - Using enhanced scheduler balancing
2016-05-18 17:44:33:  Scheduler - Successfully dequeued 1 task(s).  Returning.
2016-05-18 17:44:33:  Scheduler - Returning limit stubs not in use.
2016-05-18 17:44:33:  Scheduler Thread - All job files are already synchronized
2016-05-18 17:44:33:  Syncronizing Plugin 3dsmax from \\inferno2\deadline\test_repository8\plugins\3dsmax
2016-05-18 17:44:34:  Syncronization complete. Took: 0 seconds
2016-05-18 17:44:34:  0: Got task!
2016-05-18 17:44:37:  0: Loaded plugin 3dsmax (\\inferno2\deadline\test_repository8\plugins\3dsmax)
2016-05-18 17:44:37:  0: Executing plugin command of type 'Initialize Plugin'
2016-05-18 17:44:38:  0: INFO: Executing plugin script 'C:\Users\laszlo.sebo\AppData\Local\Thinkbox\Deadline8\slave\lapro3067\plugins\573d08187a3a9f1c40735b8a\3dsmax.py'
2016-05-18 17:44:39:  0: INFO: About: 3dsmax Plugin for Deadline
2016-05-18 17:44:39:  0: INFO: The job's environment will be merged with the current environment before rendering
2016-05-18 17:44:39:  0: Done executing plugin command of type 'Initialize Plugin'
2016-05-18 17:44:39:  0: Start Job timeout is disabled.
2016-05-18 17:44:39:  0: Task timeout is 21600 seconds (Regular Task Timeout)
2016-05-18 17:44:39:  0: Loaded job: [TEST] TST_000_0000_v0265_lse_test_images_render3d_render_0 (573d08187a3a9f1c40735b8a)
2016-05-18 17:44:39:  0: Executing plugin command of type 'Start Job'

jgaudet · May 19, 2016, 11:34pm

Assuming ‘573d08187a3a9f1c40735b8a’ was the ID of the job currently being rendered, something else cleaned up that machine’s limit stub, causing it to try and move on. I thought we had fixed that bug, but I guess there’s more than one instance where a stub can get returned prematurely…

Can you find the Repository Repair log that returned that stub?

I suppose either way, though, when a Slave detects its current stub is no longer valid it should at least try to re-acquire it before moving on.

LaszloSebo · May 19, 2016, 11:46pm

This farm has no pulse, so the repo repair was actually running on the same machine. Its odd that it would dequeue itself…

The repo log from the same box:

2016-05-18 17:44:18:  Performing Orphaned Limit Stub Scan...
2016-05-18 17:44:18:      Orphaned Limit Stub Scan - Loading limits
2016-05-18 17:44:18:      Orphaned Limit Stub Scan - Loaded 1 limits in 6.001 ms
2016-05-18 17:44:18:      Orphaned Limit Stub Scan - Loading slave states
2016-05-18 17:44:18:      Orphaned Limit Stub Scan - Loaded 3 slave states in 41.004 ms
2016-05-18 17:44:18:      Orphaned Limit Stub Scan - Returning orphaned limit stub '573d08187a3a9f1c40735b8a' for lapro3067: the slave is no longer running
2016-05-18 17:44:18:      Orphaned Limit Stub Scan - Returned 1 orphaned limit stubs in 59.006 ms
2016-05-18 17:44:18:      Orphaned Limit Stub Scan - Done.

LaszloSebo · May 19, 2016, 11:48pm

And here is another, a few frames later:

2016-05-18 17:50:10:  Performing Orphaned Limit Stub Scan...
2016-05-18 17:50:10:      Orphaned Limit Stub Scan - Loading limits
2016-05-18 17:50:10:      Orphaned Limit Stub Scan - Loaded 2 limits in 5.000 ms
2016-05-18 17:50:10:      Orphaned Limit Stub Scan - Loading slave states
2016-05-18 17:50:10:      Orphaned Limit Stub Scan - Loaded 3 slave states in 38.004 ms
2016-05-18 17:50:10:      Orphaned Limit Stub Scan - Returning orphaned limit stub '573d08187a3a9f1c40735b8a' for lapro3067: the slave is no longer running
2016-05-18 17:50:10:      Orphaned Limit Stub Scan - Returned 1 orphaned limit stubs in 55.006 ms
2016-05-18 17:50:10:      Orphaned Limit Stub Scan - Done.

jgaudet · May 20, 2016, 12:03am

Interesting, I bet it returned the stub because it was no longer rendering… since it was housecleaning (in spite of the obviously innacurate ‘no longer running’ message). Obviously it shouldn’t be doing that.

I think we changed the spot where Slaves do housecleaning/repo repair/etc to prevent starving those processes. Might have to switch it back, or at least prevent Slaves from cleaning their own stubs in this case.

LaszloSebo · May 20, 2016, 12:28am

Well, good question… on this staging repository, these processes take a few seconds. But on our production database, it could take minutes, sometimes up to 4-5 mins each (housecleaning, repo repair, pending scan). So maybe returning the stubs is a good behavior, just in case it would get stuck for a while?

Ill try to repro this with pulse , and if i cant that i dont think this is a major issue