job finishing tasks multiple times?

LaszloSebo · December 31, 2014, 6:49pm

This is really odd… noticed that a job is taking way longer to render than it should, and looking at the logs, it seems like each task was rendered multiple times:

The tasks were not requeued. They keep going back to queued state after completing properly, then they get rendered again…

LaszloSebo · December 31, 2014, 6:51pm

The last lines of the log from a completed task that then simply went back to being queued:

2014-12-31 10:47:51:  0: STDOUT: [2014/Dec/31|10:47:51] V-Ray: Total sequence time  0h  0m  8.0s (8.0 s)
2014-12-31 10:47:51:  0: STDOUT: [2014/Dec/31|10:47:51] V-Ray: Render complete
2014-12-31 10:47:51:  0: STDOUT: [2014/Dec/31|10:47:51] V-Ray: ========================
2014-12-31 10:47:51:  0: STDOUT: [2014/Dec/31|10:47:51] V-Ray: Clearing exporter memory...
2014-12-31 10:47:51:  0: STDOUT: [2014/Dec/31|10:47:51] V-Ray: Total time clearing exporter memory  0h  0m  0.0s (0.0 s)
2014-12-31 10:47:51:  0: STDOUT: Render completed.
2014-12-31 10:47:51:  0: STDOUT: mel: READY FOR INPUT
2014-12-31 10:47:51:  0: INFO: Reading Extra Info before copy to network.
2014-12-31 10:47:51:  0: INFO: Scanline :: Moving output files and folders from C:/Users/scanlinevfx/AppData/Local/Thinkbox/Deadline7/slave/LAPRO0603/jobsData/5494bb6a969b590e98a82e54/mayaOutput_tempKCrhe0 to //inferno2/projects/gold/scenes/TST_000_0000/images/render3d/test/v0050_jha_testing_metadata/beauty_1920x1080x1_linear
2014-12-31 10:47:51:  0: INFO: Running Scanline renamer
2014-12-31 10:47:52:  0: INFO: Running Scanline 1004:1004
2014-12-31 10:47:52:  0: PYTHON: sSourceDir C:/Users/scanlinevfx/AppData/Local/Thinkbox/Deadline7/slave/LAPRO0603/jobsData/5494bb6a969b590e98a82e54/mayaOutput_tempKCrhe0
2014-12-31 10:47:52:  0: PYTHON: sDestinationDir //inferno2/projects/gold/scenes/TST_000_0000/images/render3d/test/v0050_jha_testing_metadata/beauty_1920x1080x1_linear
2014-12-31 10:47:52:  0: PYTHON: sRenderer vray
2014-12-31 10:47:52:  0: PYTHON: sVersion v0050
2014-12-31 10:47:52:  0: PYTHON: sResolutionColor 1920x1080x1_linear
2014-12-31 10:47:52:  0: PYTHON: sMakeJPG 
2014-12-31 10:47:52:  0: PYTHON: sIsStereo 
2014-12-31 10:47:52:  0: PYTHON: renameAndMoveRenders.Moving vray render.
2014-12-31 10:47:52:  0: PYTHON: moveVRay: Looking for images in C:/Users/scanlinevfx/AppData/Local/Thinkbox/Deadline7/slave/LAPRO0603/jobsData/5494bb6a969b590e98a82e54/mayaOutput_tempKCrhe0
2014-12-31 10:47:52:  0: PYTHON: moveVRay: Found Proxy folder C:/Users/scanlinevfx/AppData/Local/Thinkbox/Deadline7/slave/LAPRO0603/jobsData/5494bb6a969b590e98a82e54/mayaOutput_tempKCrhe0\proxy_v0050
2014-12-31 10:47:52:  0: PYTHON: moveVRay: The files in the proxy folder are exr.
2014-12-31 10:47:52:  0: PYTHON: moveVRay:  Converting C:/Users/scanlinevfx/AppData/Local/Thinkbox/Deadline7/slave/LAPRO0603/jobsData/5494bb6a969b590e98a82e54/mayaOutput_tempKCrhe0\proxy_v0050 to jpg
2014-12-31 10:47:52:  0: PYTHON: Files: [u'TST_000_0000_test_proxy_v0050.1004.exr']
2014-12-31 10:47:52:  0: PYTHON: Convert 001/1 : TST_000_0000_test_proxy_v0050.1004.exr
2014-12-31 10:47:52:  0: PYTHON: Checking to see if C:/Users/scanlinevfx/AppData/Local/Thinkbox/Deadline7/slave/LAPRO0603/jobsData/5494bb6a969b590e98a82e54/mayaOutput_tempKCrhe0/proxy_v0050/TST_000_0000_test_proxy_v0050.1004.jpg exists
2014-12-31 10:47:52:  0: PYTHON: False
2014-12-31 10:47:52:  0: PYTHON: Checking to see if C:/Users/scanlinevfx/AppData/Local/Thinkbox/Deadline7/slave/LAPRO0603/jobsData/5494bb6a969b590e98a82e54/mayaOutput_tempKCrhe0\proxy_v0050/TST_000_0000_test_proxy_v0050.1004.exr exists
2014-12-31 10:47:52:  0: PYTHON: True
2014-12-31 10:47:52:  0: PYTHON: Info:  convertImage:: OIIOCommand: //s2.scanlinevfxla.com/exchange/software/managed/tools/OpenImageIO/1.4.9/win64_vs2008/oiiotool.exe C:/Users/scanlinevfx/AppData/Local/Thinkbox/Deadline7/slave/LAPRO0603/jobsData/5494bb6a969b590e98a82e54/mayaOutput_tempKCrhe0\proxy_v0050/TST_000_0000_test_proxy_v0050.1004.exr  -ch R,G,B  -o C:/Users/scanlinevfx/AppData/Local/Thinkbox/Deadline7/slave/LAPRO0603/jobsData/5494bb6a969b590e98a82e54/mayaOutput_tempKCrhe0/proxy_v0050/TST_000_0000_test_proxy_v0050.1004.jpg
2014-12-31 10:47:54:  0: PYTHON: Info:  convertImage:: Return Code: 0
2014-12-31 10:47:54:  0: PYTHON: Info:  convertImage:: Output: 
2014-12-31 10:47:54:  0: PYTHON: Info:  convertImage:: Errors: 
2014-12-31 10:47:54:  0: PYTHON: moveVRay: Moving C:/Users/scanlinevfx/AppData/Local/Thinkbox/Deadline7/slave/LAPRO0603/jobsData/5494bb6a969b590e98a82e54/mayaOutput_tempKCrhe0\proxy_v0050\TST_000_0000_test_proxy_v0050.1004.jpg to //inferno2/projects/gold/scenes/TST_000_0000/images/render3d/test/v0050_jha_testing_metadata\proxy_1920x1080x1_look\TST_000_0000_test_proxy_v0050.1004.jpg
2014-12-31 10:47:54:  0: PYTHON: moveVRay: Found file! Going to move : TST_000_0000_test.1004.exr
2014-12-31 10:47:54:  0: PYTHON: moveVRay: Moving file C:/Users/scanlinevfx/AppData/Local/Thinkbox/Deadline7/slave/LAPRO0603/jobsData/5494bb6a969b590e98a82e54/mayaOutput_tempKCrhe0\TST_000_0000_test.1004.exr to //inferno2/projects/gold/scenes/TST_000_0000/images/render3d/test/v0050_jha_testing_metadata/beauty_1920x1080x1_linear\TST_000_0000_test_beauty_v0050.1004.exr
2014-12-31 10:47:54:  0: INFO: Rename ran successfully
2014-12-31 10:47:54:  0: Render time for frame(s): 5.329 m
2014-12-31 10:47:54:  0: Total time for task: 5.349 m
2014-12-31 10:47:54:  0: Saving task log...

=======================================================
Details
=======================================================
Date: 12/31/2014 10:47:54
Frames: 1004
Job Submit Date: 12/19/2014 15:57:30
Job User: jeff.harkness
Average RAM Usage: 379169920 (1%)
Peak RAM Usage: 1246924800 (2%)
Average CPU Usage: 2%
Peak CPU Usage: 5%
Used CPU Clocks: 148876
Total CPU Clocks: 7443751

=======================================================
Slave Information
=======================================================
Slave Name: LAPRO0603
Version: v7.0.1.1 R  (4468a9403)
Operating System: Windows 7 Professional (SP1)
Machine User: scanlinevfx
IP Address: ****
MAC Address: ****
CPU Architecture: x64
CPUs: 20
CPU Usage: 4%
Memory Usage: 5.0 GB / 64.0 GB (7%)
Free Disk Space: 745.078 GB (13.121 GB on C:\, 731.958 GB on S:\)
Video Card: Standard VGA Graphics Adapter

LaszloSebo · December 31, 2014, 6:54pm

Noticed this in the full slave log:

2014-12-31 10:53:51: 0: Total time for task: 5.403 m
2014-12-31 10:53:51: 0: Saving task log…
2014-12-31 10:53:52: Scheduler Thread - Render Thread 0’s task “3_1004-1004” could not be found because task has been modified:
2014-12-31 10:53:52: current status = Rendering, new status = Queued
2014-12-31 10:53:52: current slave = LAPRO0601, new slave =
2014-12-31 10:53:52: current frames = 1004-1004, new frames = 1004-1004
2014-12-31 10:53:52: Scheduler Thread - Seconds before next job scan: 2
2014-12-31 10:53:54: Scheduler - Performing Job scan on Primary Pools with scheduling order Pool, Weighted, Balanced
2014-12-31 10:53:54: Scheduler - Successfully dequeued 1 task(s). Returning.
2014-12-31 10:53:54: 0: Shutdown
2014-12-31 10:53:54: 0: Shutdown
2014-12-31 10:53:54: 0: Exited ThreadMain(), cleaning up…
2014-12-31 10:53:54: 0: INFO: Ending Maya Job
2014-12-31 10:53:55: 0: Shutdown
2014-12-31 10:53:55: 0: INFO: Waiting for Maya to shut down
2014-12-31 10:53:55: 0: INFO: Maya has shut down
2014-12-31 10:53:55: 0: Stopped job: [GOLD] VRay Render: TST_000_0000_maya_animation_renderTest.ma resource: test version: v0050 user: jha
2014-12-31 10:53:55: 0: Unloading plugin: MayaBatch
2014-12-31 10:53:56: Scheduler - Returning limit stubs not in use.
2014-12-31 10:53:56: Scheduler - returning 5494bb6a969b590e98a82e54

Nothing is touching these jobs… Any ideas?

jgaudet · December 31, 2014, 10:54pm

Hmmm, I don’t think we changed anything in our requeued task detection…

The slaves definitely check one last time to see if the Task its been working has been requeued before it completes it. What’s your stalled-slave detection window, and your Slave info update interval?

Given that this happened three times on a single job, I don’t think it would have been stalled slave detection, but you never know… And there were no requeue reports, right?

Also, was everything on the farm running 7.0.1.1?

LaszloSebo · January 5, 2015, 5:53pm

The machines partaking in the rendering are all using 7.0.1.1. Not the entire farm is updated yet (~99% of it is), but all other slaves have no pools or are disabled.

No requeue reports at all. For this job it happened on every task 2-4 times:

Stalled slave detection is set to 15 minutes (these frames finish around an average of 5-6 minutes).
Slave info update is set to the default 20 seconds.

rrussell · January 5, 2015, 5:59pm

There was a bug in 7.0.1.1 that causes this behavior, and it’s been fixed in 7.0.1.3 (the current release, which was made available on December 30th). Upgrading to this newer version should solve the problem.

I know it’s a PITA, especially since you’ve already upgraded 99% of the farm, but once they’re all running this new version (including Pulse), you should be good.

Cheers,
Ryan

LaszloSebo · January 5, 2015, 6:43pm

Thanks, will grap .3 asap!

LaszloSebo · January 5, 2015, 11:30pm

Whats the best way to get notifications of new builds btw? I didn’t find anything on the forums / website about this new build.

rrussell · January 6, 2015, 4:21pm

We have our twitter feed:
twitter.com/thinkboxsoft

However, we should be making these announcements on the forums as well. We’ll make sure to do that going forward.

Cheers,
Ryan

Select your cookie preferences

Customize cookie preferences

Essential

Performance

Functional

Advertising

job finishing tasks multiple times?