pulse script dependency handling

LaszloSebo · October 3, 2013, 10:43pm

Hi there,

I think the output from the script dependency handling is not printed as its progressing into the pulse log, but the whole result is put there only at the very end.

We have a couple dozen jobs with a few hundred frames each, and handling the script dependencies seems to take about 10 minutes or so… (boo, so much sadness, this is still testing phase, not even full production load (( ). BUT in the log all the dependency checks are printed with the same timestamp. So it looks like pulse is hanging for 10 mins, then i get 20000 lines of dependency printouts for the same 1-2 seconds.

any ideas how to improve this?

Also, would it be possible to separate the dependency handling from the housecleaning? We might want to do the dependency checking fairly often.

cheers,
laszlo

LaszloSebo · October 3, 2013, 11:11pm

Basically if i trigged a manual house cleaning, it will take about 30 seconds before i see something in the log, along the lines of this:

2013-10-03 15:30:37: Connecting to Pulse log I connected here to the log to see whats happening
Somewhre here, i triggered a manual cleanup
2013-10-03 15:31:40: Power Management - Thermal Shutdown: Skipping zone “Laszlo” because it is disabled
2013-10-03 15:31:40: Power Management - Thermal Shutdown: Skipping zone “AnimatorWorkstations” because it is disabled
2013-10-03 15:31:40: Power Management - Thermal Shutdown: Skipping zone “Slaves” because it is disabled
2013-10-03 15:31:40: Power Management - Idle Shutdown: Skipping idle shutdown group “Laszlo” because it is disabled
2013-10-03 15:31:40: Power Management - Idle Shutdown: Skipping idle shutdown group “AnimatorWorkstations” because it is disabled
2013-10-03 15:31:40: Power Management - Idle Shutdown: Skipping idle shutdown group “Slaves” because it is disabled
2013-10-03 15:31:40: Power Management - Machine Startup: There are no slaves that need to be woken up at this time
2013-10-03 15:31:40: Power Management - Machine Restart: Skipping machine group “Laszlo” because it is disabled
2013-10-03 15:31:40: Power Management - Machine Restart: Skipping machine group “AnimatorWorkstations” because it is disabled
2013-10-03 15:31:40: Power Management - Machine Restart: Skipping machine group “Slaves” because it is disabled
2013-10-03 15:31:40: Power Management - Slave Scheduling: Skipping scheduling group “Laszlo” because it is disabled
2013-10-03 15:31:40: Power Management - Slave Scheduling: Skipping scheduling group “AnimatorWorkstations” because it is disabled
2013-10-03 15:31:40: Power Management - Slave Scheduling: Skipping scheduling group “Slaves” because it is disabled
2013-10-03 15:33:13: Server Thread - Auto Config: Received packet on autoconfig port
2013-10-03 15:33:13: Server Thread - Auto Config: Picking configuration based on:
2013-10-03 15:33:13: Server Thread - lapro2090
2013-10-03 15:33:13: Server Thread - ::ffff:172.18.8.2
2013-10-03 15:33:13: Server Thread - Auto Config: No config worth sending
2013-10-03 15:34:00: Clean Up Thread - Performing house cleaning] It finally started!
2013-10-03 15:36:48: Power Management - Thermal Shutdown: Skipping zone “Laszlo” because it is disabled
2013-10-03 15:36:48: Power Management - Thermal Shutdown: Skipping zone “AnimatorWorkstations” because it is disabled
2013-10-03 15:36:48: Power Management - Thermal Shutdown: Skipping zone “Slaves” because it is disabled
2013-10-03 15:36:48: Power Management - Idle Shutdown: Skipping idle shutdown group “Laszlo” because it is disabled
2013-10-03 15:36:48: Power Management - Idle Shutdown: Skipping idle shutdown group “AnimatorWorkstations” because it is disabled
2013-10-03 15:36:48: Power Management - Idle Shutdown: Skipping idle shutdown group “Slaves” because it is disabled
2013-10-03 15:36:48: Power Management - Machine Startup: There are no slaves that need to be woken up at this time
2013-10-03 15:36:48: Power Management - Machine Restart: Skipping machine group “Laszlo” because it is disabled
2013-10-03 15:36:48: Power Management - Machine Restart: Skipping machine group “AnimatorWorkstations” because it is disabled
2013-10-03 15:36:48: Power Management - Machine Restart: Skipping machine group “Slaves” because it is disabled
2013-10-03 15:36:48: Power Management - Slave Scheduling: Skipping scheduling group “Laszlo” because it is disabled
2013-10-03 15:36:48: Power Management - Slave Scheduling: Skipping scheduling group “AnimatorWorkstations” because it is disabled
2013-10-03 15:36:48: Power Management - Slave Scheduling: Skipping scheduling group “Slaves” because it is disabled
2013-10-03 15:37:13: Performing Job Repository Scan… 4 minutes later, its still only scanning stuff
2013-10-03 15:37:13: Loading jobs
2013-10-03 15:37:13: Scanning jobs
2013-10-03 15:37:13: Archived completed job “[VULCAN] Nuke Render: CE_100_0175_pcomp_001_v0001_mla.nk” because Auto Job Cleanup is enabled and this job has been complete for more than 3 days.
And now, 7 minutes after i triggered the cleaning, it started the dependency check
2013-10-03 15:37:13: CheckPathMapping: Swapped “//s2/exchange/software/managed/deadline/scriptDependency/scriptDependency.py” with “/mnt/s2/exchange/software/managed/deadline/scriptDependency/scriptDependency.py”
2013-10-03 15:37:13: Executing dependency script: /mnt/s2/exchange/software/managed/deadline/scriptDependency/scriptDependency.py
2013-10-03 15:37:13: The current PYTHONHOME environment variable will be modified to avoid potential Python conflicts
2013-10-03 15:37:13: The current PYTHONPATH environment variable will be modified to avoid potential Python conflicts
2013-10-03 15:37:13: Running Scanline scriptDependency checker v0.1 for JobID: 524b5a9e7a3a9e12ac2306ba TaskID: 0
2013-10-03 15:37:13: TaskList count: 2
2013-10-03 15:37:13: Task Frames: 1030-1034
2013-10-03 15:37:13: DeadlineDependencyCheck called /mnt/isila/deadline/repository6/jobs/524b5a9e7a3a9e12ac2306ba/dependencies.xml
2013-10-03 15:37:13: Dep: name: Flow path: /mnt/flspace/CL/vulcan/PP_089_0020/cache/flowline/Flow/v0015_lse_test/Flow type: Flowline start: 1031 end: 1035 nth: 1
2013-10-03 15:37:13: Flowline dependency handling
2013-10-03 15:37:13: checkFileDependencies globbing: /mnt/flspace/CL/vulcan/PP_089_0020/cache/flowline/Flow/v0015_lse_test/Flow/*/01031.00f/
2013-10-03 15:37:13: Dependency script returned False: /mnt/s2/exchange/software/managed/deadline/scriptDependency/scriptDependency.py

… here comes about 20.000 lines of log related to the dependency checking…

2013-10-03 15:37:13: Flowline dependency handling
2013-10-03 15:37:13: checkFileDependencies globbing: /mnt/flspace/CL/nemo/HB_118_0950/cache/flowline/Mist/v0021_str_deadline/Mist/*/01126.00f/
2013-10-03 15:37:13: Dependency script returned False: /mnt/s2/exchange/software/managed/deadline/scriptDependency/scriptDependency.py
2013-10-03 15:37:13: Released pending task 3 with frame(s) 954-954 for job “HB_118_0950_v0021_str_deadline_images_render3d_FL-Beauty3dFluid_0” because the frames it depends on have finished and/or its required assets are available.
2013-10-03 15:37:13: Released pending task 4 with frame(s) 955-955 for job “HB_118_0950_v0021_str_deadline_images_render3d_FL-Beauty3dFluid_0” because the frames it depends on have finished and/or its required assets are available.
2013-10-03 15:37:13: Released pending task 5 with frame(s) 956-956 for job “HB_118_0950_v0021_str_deadline_images_render3d_FL-Beauty3dFluid_0” because the frames it depends on have finished and/or its required assets are available.
2013-10-03 15:37:13: Cleaning up orphaned tasks
2013-10-03 15:37:13: Done.

All in all took about 7-8 minutes to do a single swipe. Maybe im wrong and the dependency check itself takes a very little time, but its odd that all the log lines are for the same exact second… i would expect it to be a little slower. If it indeed is that fast, then separating it into its own thread with a separate interval parameter would be best… then the slow housecleaning can run all it wants, and the super fast dependency checking could be triggered every minute.

rrussell · October 4, 2013, 2:22pm

Yup, this was a result of us moving housecleaning to a separate process, and not streaming the stdout from it properly. We’ll get this fixed for beta 7. We’ll also change the Monitor to run it in a background process so that you can still use the monitor while housecleaning in progress.

I’m wondering if the slowdown is a result of handling all this output at the end, rather than printing it out on the fly. You’ll have to test with beta 7 and let us know if the housecleaning speed is improved or not.

One thing that might improve performance in general is to change the way script dependency checking works when a job is frame dependent. Currently, the script gets executed for each pending task for a job, so if a job has 500 pending tasks, that’s 500 script executions for just one job. Currently, we are passing the job ID and task ID to the script, and the script returns True or False if it should be released. What if we just called the script once for the job, passed the job ID and a list of pending task IDs, and then the script would return a list of task IDs for the tasks that should be released? I think this would definitely help performance, and I imagine it shouldn’t complicate your current dependency scripts.

Cheers,

Ryan

LaszloSebo · October 4, 2013, 3:38pm

I will try removing the debug prints, and see if that improves performance.

Yeah i bet that would be faster! How would it then get handled if pulse goes down? Would the slaves get into racing conditions about who is checking which job / task? Or is the dependency check process lock protected to only run on a single machine at a time?

rrussell · October 4, 2013, 4:16pm

We’ll be making this change for beta 7. Here’s a summary of the current changes:

Dependency scripts for jobs that aren’t frame based will only accept one parameter in the main function, which will represent the job ID (string). This main function should return True if the job can start, or False if it cannot.
Dependency scripts for jobs that ARE frame based will accept two parameters in the main function: the job ID (string) and a list of task IDs (all strings). This main function should return a list of strings representing the IDs of the tasks that can start. If this function simply returns the second parameter to the main function as is, then all the tasks will be resumed.

We tested this here and it definitely makes a difference.

There is a lock which wraps around the Job Repository Scan portion of the housecleaning, so you shouldn’t run into any race conditions.

LaszloSebo · October 4, 2013, 6:51pm

Do you think it would be eventually possible to separate the dependency handling from the house cleaning? So we could interval it differently?

rrussell · October 4, 2013, 8:54pm

If after testing beta 7 you still think it’s necessary, we can definitely look at it.

Out of curiosity, what do you have the Repository Check Interval in the Pulse Settings in the Repository Options set to? This can always be adjusted to do quicker scans.

LaszloSebo · October 4, 2013, 9:54pm

Repository Cleanup Interval is set to 60
Power Management CHeck Interval is set to 300

It doesnt seem like its able to cycle it once a minute though, as it takes 7-9 to just do one check. I would almost want each job to be checked in separate threads at the same time (maybe from multiple machines). Ill message again after we get a chance to try beta7 though, to see how much that improves the situation.