The DeletedJobReportEntries keeps track of the job reports that need to be purged from the database. This is handled by housecleaning, but it looks like it’s getting backed up. Do you have a maximum number of job reports that can be cleaned up per session (see the Housecleaning page in the Repository Options)?
The rate at which the statistics are purged can be controlled in the Statistics Gathering page in the repository options.
Btw, i have randomly checked a couple of log files from the log folders that are 1-2+ months old, and they werent listed in this deleted entries collection. Is that expected, possible?
It’s definitely possible because of the random nature of how housecleaning used to work (we’ve removed the randomness in 6.2.1). Maybe throttle it at 2000 for now and see how it does at catching up.
Might need to do something like disable the threshold temporarily, manually launch deadlinecommand -dohousecleaning on a random machine, and then once it starts doing the reports, re-enable the threshold again. Note that on the machine you run it, you’ll want 6.2.1 installed so that it for sure does the reports. This way, that one machine can keep chugging away at cleaning up those reports without blocking your regular housecleaning.
I am getting a lot of errors in the cleanup (i noticed some errors like this in the regular pulse log as well in the past couple of days):
Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0811_str_TurbAllDirect_cache_flowline_LeadSpray_5 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadline.s
canlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0796_str_MoreCascade_images_render3d_FL-SprayMistOnly_L_0 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,de
adline.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0780_str_Test_images_render3d_FL-LeadMist_L_0 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadline.scanl
inevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0781_str_NoLarge_cache_flowline_LeadMist_0 " could not be archived because: The file '\\inferno2\deadline\repository6\jobsArchived\stephan.trojansky__3dsmax__[EXO] RS_190_2060_v0781_str_NoLarge_
cache_flowline_LeadMist_0 __540cd87b02949841a89c8807.zip' already exists. (System.IO.IOException)
Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0786_str_FullSmallBucket_cache_flowline_LeadSpray_14 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadlin
e.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
Job Cleanup Scan - Archived completed job "[EXO] RS_190_7030_v0014_lle_Add0_2_cache_flowline_OceanDif_0 " because Auto Job Cleanup is enabled and this job has been complete for more than 10 days.
Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0798_str_ImportInstead_cache_flowline_LeadHappy_0 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadline.s
canlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0786_str_FullSmallBucket_cache_flowline_LeadSpray_4 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadline
.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0781_str_NoLarge_cache_flowline_LeadSpray_0 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadline.scanlin
evfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0786_str_FullSmallBucket_cache_flowline_LeadSpray_3 " could not be archived because: The file '\\inferno2\deadline\repository6\jobsArchived\stephan.trojansky__3dsmax__[EXO] RS_190_2060_v0786_str
_FullSmallBucket_cache_flowline_LeadSpray_3 __540ce8db0294983f10e5fe5b.zip' already exists. (System.IO.IOException)
Job Cleanup Scan - Archived completed job "[EXO] RS_190_2060_v0770_str_FixedWaveDisp_images_render3d_FL-FoamCaps_L_0 " because Auto Job Cleanup is enabled and this job has been complete for more than 10 days.
Job Cleanup Scan - Warning: completed job "[EXO] RS_190_7030_v0013_lle_Add0_5_images_render3d_FL-OceanDif_0 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadline.scanl
inevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0791_str_NewAttempt3_cache_flowline_MistFine_0 " could not be archived because: The process cannot access the file '\\inferno2\deadline\repository6\jobsArchived\stephan.trojansky__3dsmax__[EXO]
RS_190_2060_v0791_str_NewAttempt3_cache_flowline_MistFine_0 __540d10190294984c34811045.zip' because it is being used by another process. (System.IO.IOException)
Job Cleanup Scan - Archived completed job "[EXO] RS_190_2060_v0792_str_Safety_cache_flowline_LeadHappy_1 " because Auto Job Cleanup is enabled and this job has been complete for more than 10 days.
Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0785_str_WithCollision_images_render3d_FL-LeadSpray_L_0 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,dead
line.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0786_str_FullSmallBucket_cache_flowline_LeadSpray_2 " could not be archived because: The file '\\inferno2\deadline\repository6\jobsArchived\stephan.trojansky__3dsmax__[EXO] RS_190_2060_v0786_str
_FullSmallBucket_cache_flowline_LeadSpray_2 __540ce8d90294981f481e0bb9.zip' already exists. (System.IO.IOException)
Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0791_str_NewAttempt3_cache_flowline_Spray_3 " could not be archived because: The file '\\inferno2\deadline\repository6\jobsArchived\stephan.trojansky__3dsmax__[EXO] RS_190_2060_v0791_str_NewAtte
mpt3_cache_flowline_Spray_3 __540d100b02949816a85609f3.zip' already exists. (System.IO.IOException)
Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0813_str_Update_cache_flowline_WaveFoam_0 " could not be archived because: The file '\\inferno2\deadline\repository6\jobsArchived\stephan.trojansky__3dsmax__[EXO] RS_190_2060_v0813_str_Update_ca
che_flowline_WaveFoam_0 __540d60020294983d10fa16f3.zip' already exists. (System.IO.IOException)
Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0786_str_FullSmallBucket_cache_flowline_LeadSpray_1 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadline
.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
Job Cleanup Scan - Archived completed job "[EXO] RS_190_2060_v0784_str_BetterSplash_cache_flowline_LeadMist_0 " because Auto Job Cleanup is enabled and this job has been complete for more than 10 days.
Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0784_str_BetterSplash_cache_flowline_LeadSpray_0 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadline.sc
anlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
Could you maybe archive one or two of the jobs you’re getting those errors for and post them (or email them to us directly if they contain private information)? I’m guessing that there is some corruption in the job object, and a database repair might resolve that, but it would be could to get them out here and see if we can reproduce and get a better understanding of what the problem is.
We’ll be tweaking that error message in the next build where it says “unexpected exception” to show the stacktrace and exception type. At the very least, that can point to the code that is causing the error to happen in the first place.
Seems like running the cleanup on another machine with no limits doesnt really work. It seems to requery the limits from the repo as it goes,… so its been going for 1-2+ hours when it gets to the report file deletion, and then picks up the (then) reduced limit instead of the unlimited value that was set when the executable was started.
Here’s a deadline.dll that you can drop into the bin folder on the machine you’re running the housecleaning command from. Keep in mind that this has only been tested against 6.2.1.39 (beta 2). We’ve updated the DoHouseCleaning command to accept a third optional argument, which indicates the mode you want to use. Here are the new usage instructions:
DoHouseCleaning
Performs house cleaning operations
[Random Multiplier] Optional (deprecated, just specify 0).
[Verbose] Optional. If logging should be enabled
(true/false).
[Mode] Optional. If not specified, all housecleaning
operations will be performed. Available modes
are: CleanupCompletedJobs,
DeleteUnsubmittedJobs, PurgeDeletedJobs,
PurgeOldAuxFiles, PurgeOldJobReports,
FindOrphanedTasks, CleanupObsoleteSlaves,
PurgeOldSlaveReports, PurgeJobMachineLimits,
FindOrphanedLimitStubs, CleanupRepoValidations,
PurgeOldStatistics, CleanupDatabase,
FindStalledSlaves
Also, this build will now show a stacktrace for unexpected mongodb exceptions, like the ones you were seeing during the job cleanup operation. So if you are still seeing those errors, post the new message, and if possible, could you export the json for the job directly from the database and send it to us? Then we could imported it directly into the database and see if we can reproduce the error.
Thanks Ryan, ill give this a go right now! Does it get stopped by the housecleaning lock file? I had to stop pulse, kill the processes and delete the lock file to be able to run it from another machine, then restart pulse.
Doing it in batches of 5000, i see the number slowly decreasing. Takes between 2-4 mins to clean out 5000 reports, so hopefully it gets cleared out within the next 2-3 days.
I think the problem is that the cycle time of the housecleaning can be 20-30 mins. In that time, we generate tens of thousands of reports. But if we set that cleanup threshold too high, it would make the housecleaning take that much longer, not noticing stalled slaves etc (which is already a problem).
In 7, you guys split it into another thread right? The repair, housecleaning, pending scan?