Hey guys,
We are running into symptoms that tell me that certain db operations are not done in a transactioned manner. For example, the job status gets out of sync with the actual status of the tasks, blocking jobs from ever picking up. Our wranglers are playing whack-a-mole with these jobs…
Attached is a job that was submitted with a script / per frame dependency. All frames are either queued or rendered&complete, but the job itself says its pending… It will never pick up till someone notices this, then resubmits the job. With ~10.000 jobs in the queue, you can imagine having this happen randomly every day to 10-20 jobs is not desirable…
cheers
laszlo
edit: suspend/resume doesnt always fix it (if at all), changed to ‘resubmit’
json__53c8a96ce74b540ef835e4ac.tar (210 KB)
Hey Laszlo,
We’re aware of this issue, and this tends to pop up when the database is getting hammered. The performance improvements in v7 will help address this, but we will also be looking at the possibility of adding an option to easily get a job back out of this inconsistent state. We might be able to handle this with housecleaning as well, but that would be an expensive housecleaning operation since every job and every job’s tasks will have to be loaded (unless maybe we just focus on non-complete jobs).
Cheers,
Ryan
Is there no way to do transactions with mongo? I guess in 7, it would even be harder with split databases… the main problem i guess is partial updates of the information, leading to inconsistencies.
No, unfortunately transactions are not supported in mongo.
I think some sort of a protection scheme (checking results after updates) or giving the ability to slaves to mark a jobs “to be fixed” due to “im pretty sure this is bad, because i lost connection midway through updating” so that it can be fixed later by pulse is required.
Right now, the impression is that deadline is flaky and unreliable, because random jobs get into states where they never pick up anymore.
Less load on the db will just make this problem more rare, or only happen when we upgrade our farm again with more procs, but is not really a fix in my opinion.
I’m not sure how reliable an additional check would be, given that the slave’s connection could still be lost at the time of the check.
I found this section in the MongoDb manual:
docs.mongodb.org/manual/tutorial … e-commits/
That might be a good lead, since if we have a record of any attempted transactions, we can rollback or repair the job as part of the regular housecleaning operations. That would be much more efficient than combing over all jobs. Of course though, this adds additional write operations, but for the sake of reliability, it’s worth it. We’ll definitely be having some internal discussions on the best way to solve this issue.
Cheers,
Ryan