AWS Thinkbox Discussion Forums

document count in mongo

Could you guys explain what these entries are?

DeletedJobReportEntries
namespace: deadline6db.DeletedJobReportEntries
objects: 5438375
average object size: 750.0 Bytes
data size: 3.8 GB
storage size: 10.2 GB
extents: 25
last extent size: 2.0 GB
padding factor: 1
system flags: HasIdIndex
user flags: None
indexes: 1
total index size: 350.2 MB
index sizes:
id: 350.159 MB

JobStatistics
namespace: deadline6db.JobStatistics
objects: 1127424
average object size: 1.4 KB
data size: 1.5 GB
storage size: 2.0 GB
extents: 20
last extent size: 534.5 MB
padding factor: 1
system flags: HasIdIndex
user flags: None
indexes: 2
total index size: 95.7 MB
index sizes:
id: 65.840 MB
EntryTime_1: 29.895 MB

SlaveStatistics
namespace: deadline6db.SlaveStatistics
objects: 7494674
average object size: 167.0 Bytes
data size: 1.2 GB
storage size: 2.0 GB
extents: 20
last extent size: 534.5 MB
padding factor: 1
system flags: HasIdIndex
user flags: None
indexes: 2
total index size: 781.5 MB
index sizes:
id: 482.501 MB
EntryTime_1: 298.970 MB

The DeletedJobReportEntries keeps track of the job reports that need to be purged from the database. This is handled by housecleaning, but it looks like it’s getting backed up. Do you have a maximum number of job reports that can be cleaned up per session (see the Housecleaning page in the Repository Options)?

The rate at which the statistics are purged can be controlled in the Statistics Gathering page in the repository options.

Cheers,
Ryan

Yes we had it throttled at 200. I set it to 50000 yesterday, which made housecleaning take ~2-3 hours per cycle…

Correction, it completely broke housecleaning… the last process is running since yesterday:

root 21420 0.4 0.6 1775208 207428 ? Sl Sep17 4:35 /opt/mono-2.10.9/bin/mono /opt/Thinkbox/Deadline6/bin/deadlinecommand.exe -DoHouseCleaning 10 True

We lost a bunch of sims because of this, as stalled machines werent detected :\

Btw, i have randomly checked a couple of log files from the log folders that are 1-2+ months old, and they werent listed in this deleted entries collection. Is that expected, possible?

It’s definitely possible because of the random nature of how housecleaning used to work (we’ve removed the randomness in 6.2.1). Maybe throttle it at 2000 for now and see how it does at catching up.

If anything, its slowly creeping up… by now its 5.6 mil

DeletedJobReportEntries
namespace: deadline6db.DeletedJobReportEntries
objects: 5640976
average object size: 746.0 Bytes

Might need to do something like disable the threshold temporarily, manually launch deadlinecommand -dohousecleaning on a random machine, and then once it starts doing the reports, re-enable the threshold again. Note that on the machine you run it, you’ll want 6.2.1 installed so that it for sure does the reports. This way, that one machine can keep chugging away at cleaning up those reports without blocking your regular housecleaning.

Good idea, ill do this

I am getting a lot of errors in the cleanup (i noticed some errors like this in the regular pulse log as well in the past couple of days):

    Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0811_str_TurbAllDirect_cache_flowline_LeadSpray_5 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadline.s
canlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
    Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0796_str_MoreCascade_images_render3d_FL-SprayMistOnly_L_0 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,de
adline.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
    Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0780_str_Test_images_render3d_FL-LeadMist_L_0 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadline.scanl
inevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
    Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0781_str_NoLarge_cache_flowline_LeadMist_0 " could not be archived because: The file '\\inferno2\deadline\repository6\jobsArchived\stephan.trojansky__3dsmax__[EXO] RS_190_2060_v0781_str_NoLarge_
cache_flowline_LeadMist_0 __540cd87b02949841a89c8807.zip' already exists. (System.IO.IOException)
    Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0786_str_FullSmallBucket_cache_flowline_LeadSpray_14 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadlin
e.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
    Job Cleanup Scan - Archived completed job "[EXO] RS_190_7030_v0014_lle_Add0_2_cache_flowline_OceanDif_0 " because Auto Job Cleanup is enabled and this job has been complete for more than 10 days.
    Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0798_str_ImportInstead_cache_flowline_LeadHappy_0 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadline.s
canlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
    Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0786_str_FullSmallBucket_cache_flowline_LeadSpray_4 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadline
.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
    Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0781_str_NoLarge_cache_flowline_LeadSpray_0 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadline.scanlin
evfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
    Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0786_str_FullSmallBucket_cache_flowline_LeadSpray_3 " could not be archived because: The file '\\inferno2\deadline\repository6\jobsArchived\stephan.trojansky__3dsmax__[EXO] RS_190_2060_v0786_str
_FullSmallBucket_cache_flowline_LeadSpray_3 __540ce8db0294983f10e5fe5b.zip' already exists. (System.IO.IOException)
    Job Cleanup Scan - Archived completed job "[EXO] RS_190_2060_v0770_str_FixedWaveDisp_images_render3d_FL-FoamCaps_L_0 " because Auto Job Cleanup is enabled and this job has been complete for more than 10 days.
    Job Cleanup Scan - Warning: completed job "[EXO] RS_190_7030_v0013_lle_Add0_5_images_render3d_FL-OceanDif_0 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadline.scanl
inevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
    Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0791_str_NewAttempt3_cache_flowline_MistFine_0 " could not be archived because: The process cannot access the file '\\inferno2\deadline\repository6\jobsArchived\stephan.trojansky__3dsmax__[EXO]
RS_190_2060_v0791_str_NewAttempt3_cache_flowline_MistFine_0 __540d10190294984c34811045.zip' because it is being used by another process. (System.IO.IOException)
    Job Cleanup Scan - Archived completed job "[EXO] RS_190_2060_v0792_str_Safety_cache_flowline_LeadHappy_1 " because Auto Job Cleanup is enabled and this job has been complete for more than 10 days.
    Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0785_str_WithCollision_images_render3d_FL-LeadSpray_L_0 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,dead
line.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
    Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0786_str_FullSmallBucket_cache_flowline_LeadSpray_2 " could not be archived because: The file '\\inferno2\deadline\repository6\jobsArchived\stephan.trojansky__3dsmax__[EXO] RS_190_2060_v0786_str
_FullSmallBucket_cache_flowline_LeadSpray_2 __540ce8d90294981f481e0bb9.zip' already exists. (System.IO.IOException)
    Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0791_str_NewAttempt3_cache_flowline_Spray_3 " could not be archived because: The file '\\inferno2\deadline\repository6\jobsArchived\stephan.trojansky__3dsmax__[EXO] RS_190_2060_v0791_str_NewAtte
mpt3_cache_flowline_Spray_3 __540d100b02949816a85609f3.zip' already exists. (System.IO.IOException)
    Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0813_str_Update_cache_flowline_WaveFoam_0 " could not be archived because: The file '\\inferno2\deadline\repository6\jobsArchived\stephan.trojansky__3dsmax__[EXO] RS_190_2060_v0813_str_Update_ca
che_flowline_WaveFoam_0 __540d60020294983d10fa16f3.zip' already exists. (System.IO.IOException)
    Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0786_str_FullSmallBucket_cache_flowline_LeadSpray_1 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadline
.scanlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)
    Job Cleanup Scan - Archived completed job "[EXO] RS_190_2060_v0784_str_BetterSplash_cache_flowline_LeadMist_0 " because Auto Job Cleanup is enabled and this job has been complete for more than 10 days.
    Job Cleanup Scan - Warning: completed job "[EXO] RS_190_2060_v0784_str_BetterSplash_cache_flowline_LeadSpray_0 " could not be archived because: An unexpected error occurred while interacting with the database (deadline01.scanlinevfxla.com:27017,deadline.sc
anlinevfxla.com:27017,deadline03.scanlinevfxla.com:27017):
Value cannot be null.
Parameter name: document (FranticX.Database.DatabaseConnectionException)

Could you maybe archive one or two of the jobs you’re getting those errors for and post them (or email them to us directly if they contain private information)? I’m guessing that there is some corruption in the job object, and a database repair might resolve that, but it would be could to get them out here and see if we can reproduce and get a better understanding of what the problem is.

Thanks!
Ryan

I cant actually see these jobs when filtering by job ID in the monitor… maybe leftover db entries from a half finished cleanup, or competing cleanups?

Ill let a full housecleaning go through, then see if i get these errors in the next one i run

Could be…

We’ll be tweaking that error message in the next build where it says “unexpected exception” to show the stacktrace and exception type. At the very least, that can point to the code that is causing the error to happen in the first place.

Seems like running the cleanup on another machine with no limits doesnt really work. It seems to requery the limits from the repo as it goes,… so its been going for 1-2+ hours when it gets to the report file deletion, and then picks up the (then) reduced limit instead of the unlimited value that was set when the executable was started.

So not sure how to clean things up :slight_smile:

Here’s a deadline.dll that you can drop into the bin folder on the machine you’re running the housecleaning command from. Keep in mind that this has only been tested against 6.2.1.39 (beta 2). We’ve updated the DoHouseCleaning command to accept a third optional argument, which indicates the mode you want to use. Here are the new usage instructions:

DoHouseCleaning
  Performs house cleaning operations
    [Random Multiplier]      Optional (deprecated, just specify 0).
    [Verbose]                Optional. If logging should be enabled
                             (true/false).
    [Mode]                   Optional. If not specified, all housecleaning
                             operations will be performed. Available modes
                             are: CleanupCompletedJobs,
                             DeleteUnsubmittedJobs, PurgeDeletedJobs,
                             PurgeOldAuxFiles, PurgeOldJobReports,
                             FindOrphanedTasks, CleanupObsoleteSlaves,
                             PurgeOldSlaveReports, PurgeJobMachineLimits,
                             FindOrphanedLimitStubs, CleanupRepoValidations,
                             PurgeOldStatistics, CleanupDatabase,
                             FindStalledSlaves

So to purge reports, just run this:

deadlinecommand.exe -dohousecleaning 0 true purgeoldjobreports

It will still respect the caps on the number of items to purge, but you can just run this in a loop over and over.

Hope this helps!
Ryan
deadline.zip (465 KB)

Also, this build will now show a stacktrace for unexpected mongodb exceptions, like the ones you were seeing during the job cleanup operation. So if you are still seeing those errors, post the new message, and if possible, could you export the json for the job directly from the database and send it to us? Then we could imported it directly into the database and see if we can reproduce the error.

Thanks Ryan, ill give this a go right now! Does it get stopped by the housecleaning lock file? I had to stop pulse, kill the processes and delete the lock file to be able to run it from another machine, then restart pulse.

No, it doesn’t check for the lock file when a mode is specified.

Doing it in batches of 5000, i see the number slowly decreasing. Takes between 2-4 mins to clean out 5000 reports, so hopefully it gets cleared out within the next 2-3 days.

I think the problem is that the cycle time of the housecleaning can be 20-30 mins. In that time, we generate tens of thousands of reports. But if we set that cleanup threshold too high, it would make the housecleaning take that much longer, not noticing stalled slaves etc (which is already a problem).

In 7, you guys split it into another thread right? The repair, housecleaning, pending scan?

Privacy | Site terms | Cookie preferences