We get a lot of these in the last couple days. You know about our db overload issues, im guessing its somehow related. Only way to fix jobs like this is to requeue them then manually mark them complete
We have about 200 jobs like this right now, sadly this means that jobs that depend on these will never trigger
Randomly spot checking a couple of these jobs:
{ “_id” : “5424449bdcad6c0538d484cc”, “LastWriteTime” : { “$date” : 1411697837690 }, “Props” : { “Limit” : 500, “RelPer” : -1, “Slaves” : [], “White” : false, “SlavesEx” : [] }, “Name” : “5424449bdcad6c0538d484cc”, “Stubs” : [ { “Holder” : “lapro0607-secondary”, “Time” : { “$date” : 1411697837690 } } ], “StubCount” : 1, “StubLevel” : 0, “Type” : 1 }
these dont seem to have any phantom holders:
{ “_id” : “54248f44162dfe20641cd1f4”, “LastWriteTime” : { “$date” : 1411697915918 }, “Props” : { “Limit” : 1, “RelPer” : -1, “Slaves” : [], “White” : false, “SlavesEx” : [] }, “Name” : “54248f44162dfe20641cd1f4”, “Stubs” : [], “StubCount” : 0, “StubLevel” : 0, “Type” : 1 }
{ “_id” : “54249e9e9154b52dd0ccbd8c”, “LastWriteTime” : { “$date” : 1411697915903 }, “Props” : { “Limit” : 1, “RelPer” : -1, “Slaves” : [], “White” : false, “SlavesEx” : [] }, “Name” : “54249e9e9154b52dd0ccbd8c”, “Stubs” : [], “StubCount” : 0, “StubLevel” : 0, “Type” : 1 }
Uploaded the full job jsons for these 3
json__5424449bdcad6c0538d484cc.tar (120 KB)
json__54248f44162dfe20641cd1f4.tar (120 KB)
json__54249e9e9154b52dd0ccbd8c.tar (80 KB)
Its weird that you can have a negative queued chunk count:
“Tasks” : 115, “CompletedChunks” : 115, “QueuedChunks” : -1, “SuspendedChunks” : 0, “RenderingChunks” : 1, “FailedChunks” : 0, “PendingChunks” : 0
Another one, all tasks finished:
“Tasks” : 202, “CompletedChunks” : 201, “QueuedChunks” : 0, “SuspendedChunks” : 0, “RenderingChunks” : 1, “FailedChunks” : 0, “PendingChunks” : 0,
json__5421d2dd14529e0aa8436466.tar (130 KB)
It’s definitely related. We use increment/decrement queries to set the chunk counts in the job object as tasks change state, and if one of those connections times out, this can happen. It seems like your db load has stabilized, which should mean that you will see this issue much less. We are still looking into ways to avoid this from happening.
Cheers,
Ryan
We are still seeing this, even after the db load stabilized. New jobs get this in a relatively calm db environment.
We have jobs submitted around 10pm last night (already in the calm period) that exhibit this behavior
We have several hundred of these