AWS Thinkbox Discussion Forums

tips for speeding up housecleaning?

Our housecleaning process can get really excessive, especially the “Purging deleted job” phase.

Can you guys give us any info on what exactly is being done in that step, and how we could potentially improve the speeds? I’m guessing thats when logs/reports etc are being deleted from the repository, but is there anything else?

Have you done any tests with connecting the pulse server to an accelerator node instead of a regular cluster node and whether that affects performance? Or different network connectivity setups / mount settings?

We are throttling the deletion to a maximum of 1000 jobs at a time to make sure no other housecleaning process is starver too long, but it still goes over 1-2 hours…:

This was relatively ‘quick’:

2018-03-07 23:36:19: Deleted Job Scan - Loaded 18294 deleted jobs in 15.860 s
2018-03-07 23:36:19: Deleted Job Scan - Purging a maximum of 1000 deleted Jobs
2018-03-07 23:36:19: Deleted Job Scan - Purging deleted job ‘5a954b7f34e0902a006eb92b’ because it was deleted over 12 hour(s) ago
2018-03-07 23:36:20: Deleted Job Scan - Purging deleted job ‘5a94825d3db75a87d02f7a96’ because it was deleted over 12 hour(s) ago
2018-03-07 23:36:22: Deleted Job Scan - Purging deleted job ‘5a9c588634754936041fa2f0’ because it was deleted over 12 hour(s) ago
2018-03-07 23:36:22: Deleted Job Scan - Purging deleted job ‘5a8e8798eb60813a3040698c’ because it was deleted over 12 hour(s) ago

2018-03-08 00:17:55: Group Name Count Average Time Total Time
2018-03-08 00:17:55: ConsoleCommandInvoke Collect Types 2 0.0235 0.047
2018-03-08 00:17:55: ConsoleCommandInvoke Find Type 2 0.0005 0.001
2018-03-08 00:17:55: ConsoleCommandInvoke Create Type 2 0 0
2018-03-08 00:17:55: ------------------ ---------------------------------------- ------------ ------------ ------------
2018-03-08 00:17:55: GROUP TOTAL 6 0.008 0.048
2018-03-08 00:17:55: House Cleaning Purge Deleted Jobs 1 2510.16 2510.16
2018-03-08 00:17:55: House Cleaning Job Cleanup Scan 1 14.356 14.356
2018-03-08 00:17:55: House Cleaning Purge Old Slave Reports 1 0.369 0.369
2018-03-08 00:17:55: House Cleaning Purge Database 1 0.177 0.177
2018-03-08 00:17:55: House Cleaning Purge Old Statistics 1 0.172 0.172
2018-03-08 00:17:55: House Cleaning Purge Timed Out Slaves In Throttle Queue 1 0.119 0.119
2018-03-08 00:17:55: House Cleaning Delete Unsubmitted Jobs 1 0.036 0.036
2018-03-08 00:17:55: House Cleaning Purge Obsolete Slaves 1 0.002 0.002
2018-03-08 00:17:55: House Cleaning Purge Expired Remote Commands 1 0.002 0.002
2018-03-08 00:17:55: ------------------ ---------------------------------------- ------------ ------------ ------------
2018-03-08 00:17:55: GROUP TOTAL 9 280.599 2525.4

The one after took around 1h40m+:

2018-03-08 00:23:28: Purging Deleted Jobs
2018-03-08 00:23:28: Deleted Job Scan - Loading deleted jobs
2018-03-08 00:23:45: Deleted Job Scan - Loaded 17396 deleted jobs in 16.648 s
2018-03-08 00:23:45: Deleted Job Scan - Purging a maximum of 1000 deleted Jobs
2018-03-08 00:23:45: Deleted Job Scan - Purging deleted job ‘5a99a76dac69e21dd019ad8b’ because it was deleted over 12 hour(s) ago
2018-03-08 00:23:50: Deleted Job Scan - Purging deleted job ‘5a9b5a32b352d797acfcc6c4’ because it was deleted over 12 hour(s) ago

2018-03-08 02:13:34: Deleted Job Scan - Purging deleted job ‘5a95f2a0be27015240f02620’ because it was deleted over 12 hour(s) ago
2018-03-08 02:13:39: Deleted Job Scan - Purging deleted job ‘5a956d01d0e515d2281bb717’ because it was deleted over 12 hour(s) ago
2018-03-08 02:13:46: Deleted Job Scan - Purging deleted job ‘5a95f7037a7f2428b42e8cb8’ because it was deleted over 12 hour(s) ago
2018-03-08 02:13:47: Deleted Job Scan - Purging deleted job ‘5a9b26b1b352d758c445c5e9’ because it was deleted over 12 hour(s) ago
2018-03-08 02:14:01: Deleted Job Scan - Purged 1000 deleted jobs in 1.838 hrs
2018-03-08 02:14:01: Deleted Job Scan - Done.
2018-03-08 02:14:01: Purging Obsolete Slaves


2018-03-08 02:14:02: Profiling Section:
2018-03-08 02:14:02: Group Name Count Average Time Total Time
2018-03-08 02:14:02: ConsoleCommandInvoke Collect Types 2 0.0245 0.049
2018-03-08 02:14:02: ConsoleCommandInvoke Find Type 2 0 0
2018-03-08 02:14:02: ConsoleCommandInvoke Create Type 2 0 0
2018-03-08 02:14:02: ------------------ ---------------------------------------- ------------ ------------ ------------
2018-03-08 02:14:02: GROUP TOTAL 6 0.00816667 0.049
2018-03-08 02:14:02: House Cleaning Purge Deleted Jobs 1 6632.58 6632.58
2018-03-08 02:14:02: House Cleaning Job Cleanup Scan 1 15.188 15.188
2018-03-08 02:14:02: House Cleaning Purge Old Statistics 1 0.574 0.574
2018-03-08 02:14:02: House Cleaning Purge Old Slave Reports 1 0.341 0.341
2018-03-08 02:14:02: House Cleaning Purge Timed Out Slaves In Throttle Queue 1 0.225 0.225
2018-03-08 02:14:02: House Cleaning Purge Database 1 0.186 0.186
2018-03-08 02:14:02: House Cleaning Delete Unsubmitted Jobs 1 0.036 0.036
2018-03-08 02:14:02: House Cleaning Purge Obsolete Slaves 1 0.001 0.001
2018-03-08 02:14:02: House Cleaning Purge Expired Remote Commands 1 0.001 0.001
2018-03-08 02:14:02: ------------------ ---------------------------------------- ------------ ------------ ------------
2018-03-08 02:14:02: GROUP TOTAL 9 738.792 6649.13

cheers
laszlo

We re-engineered job reports awhile back because we used to do one delete call per job report and that was hosing people’s storage arrays IIRC.

Is this being slow with Deadline 8 or Deadline 10?

I did some digging here and it looks like this is the algorithm:

For each job in deleted jobs:
1. Run PrePurgeCleanup
2. Add to a list of jobs to purge (this is where your slowdown is happening)
3. If our list hit the max to clean out, short circuit the loop
Mass-call OnJobsPurged
Mass-delete the jobs from the DB.

It looks like PrePurgeCleanup does delete the job reports and aux files, so that’s likely what you’re waiting for. I’m curious about the Deadline version you’re on, but can you see if we’re still calling individual deletes on the bz2 files in the reports folder? We should be doing single directory deletes.

Also, I wonder how bad an Idea it would be to have a thread pool call these delete operations. On a storage cluster I assume it would be alright, but on file servers with single logical disks / arrays it may not be a good plan.

Most of the farm is on 10.0.11.1, but the pulse boxes are using 10.0.12.0.

Well, at least the folder deletes are happening then.

Is there a way to view performance metrics on delete calls on the storage side? I’d hate to think we changed around how we do the log report storage and it had no effect.

Privacy | Site terms | Cookie preferences