Inconsistent job states

LaszloSebo · September 26, 2014, 12:12am

We get a lot of these in the last couple days. You know about our db overload issues, im guessing its somehow related. Only way to fix jobs like this is to requeue them then manually mark them complete

LaszloSebo · September 26, 2014, 2:08am

We have about 200 jobs like this right now, sadly this means that jobs that depend on these will never trigger

LaszloSebo · September 26, 2014, 2:21am

Randomly spot checking a couple of these jobs:

{ “_id” : “5424449bdcad6c0538d484cc”, “LastWriteTime” : { “$date” : 1411697837690 }, “Props” : { “Limit” : 500, “RelPer” : -1, “Slaves” : [], “White” : false, “SlavesEx” : [] }, “Name” : “5424449bdcad6c0538d484cc”, “Stubs” : [ { “Holder” : “lapro0607-secondary”, “Time” : { “$date” : 1411697837690 } } ], “StubCount” : 1, “StubLevel” : 0, “Type” : 1 }

these dont seem to have any phantom holders:
{ “_id” : “54248f44162dfe20641cd1f4”, “LastWriteTime” : { “$date” : 1411697915918 }, “Props” : { “Limit” : 1, “RelPer” : -1, “Slaves” : [], “White” : false, “SlavesEx” : [] }, “Name” : “54248f44162dfe20641cd1f4”, “Stubs” : [], “StubCount” : 0, “StubLevel” : 0, “Type” : 1 }

{ “_id” : “54249e9e9154b52dd0ccbd8c”, “LastWriteTime” : { “$date” : 1411697915903 }, “Props” : { “Limit” : 1, “RelPer” : -1, “Slaves” : [], “White” : false, “SlavesEx” : [] }, “Name” : “54249e9e9154b52dd0ccbd8c”, “Stubs” : [], “StubCount” : 0, “StubLevel” : 0, “Type” : 1 }

Uploaded the full job jsons for these 3
json__5424449bdcad6c0538d484cc.tar (120 KB)
json__54248f44162dfe20641cd1f4.tar (120 KB)
json__54249e9e9154b52dd0ccbd8c.tar (80 KB)

LaszloSebo · September 26, 2014, 2:22am

Its weird that you can have a negative queued chunk count:

“Tasks” : 115, “CompletedChunks” : 115, “QueuedChunks” : -1, “SuspendedChunks” : 0, “RenderingChunks” : 1, “FailedChunks” : 0, “PendingChunks” : 0

LaszloSebo · September 26, 2014, 2:27am

Another one, all tasks finished:

“Tasks” : 202, “CompletedChunks” : 201, “QueuedChunks” : 0, “SuspendedChunks” : 0, “RenderingChunks” : 1, “FailedChunks” : 0, “PendingChunks” : 0,
json__5421d2dd14529e0aa8436466.tar (130 KB)

rrussell · September 26, 2014, 1:00pm

It’s definitely related. We use increment/decrement queries to set the chunk counts in the job object as tasks change state, and if one of those connections times out, this can happen. It seems like your db load has stabilized, which should mean that you will see this issue much less. We are still looking into ways to avoid this from happening.

Cheers,
Ryan

LaszloSebo · September 26, 2014, 2:35pm

We are still seeing this, even after the db load stabilized. New jobs get this in a relatively calm db environment.

LaszloSebo · September 26, 2014, 2:38pm

We have jobs submitted around 10pm last night (already in the calm period) that exhibit this behavior

LaszloSebo · September 26, 2014, 2:40pm

We have several hundred of these