AWS Thinkbox Discussion Forums

Inconsistent job states

We get a lot of these in the last couple days. You know about our db overload issues, im guessing its somehow related. Only way to fix jobs like this is to requeue them then manually mark them complete

We have about 200 jobs like this right now, sadly this means that jobs that depend on these will never trigger

Randomly spot checking a couple of these jobs:

{ “_id” : “5424449bdcad6c0538d484cc”, “LastWriteTime” : { “$date” : 1411697837690 }, “Props” : { “Limit” : 500, “RelPer” : -1, “Slaves” : [], “White” : false, “SlavesEx” : [] }, “Name” : “5424449bdcad6c0538d484cc”, “Stubs” : [ { “Holder” : “lapro0607-secondary”, “Time” : { “$date” : 1411697837690 } } ], “StubCount” : 1, “StubLevel” : 0, “Type” : 1 }

these dont seem to have any phantom holders:
{ “_id” : “54248f44162dfe20641cd1f4”, “LastWriteTime” : { “$date” : 1411697915918 }, “Props” : { “Limit” : 1, “RelPer” : -1, “Slaves” : [], “White” : false, “SlavesEx” : [] }, “Name” : “54248f44162dfe20641cd1f4”, “Stubs” : [], “StubCount” : 0, “StubLevel” : 0, “Type” : 1 }

{ “_id” : “54249e9e9154b52dd0ccbd8c”, “LastWriteTime” : { “$date” : 1411697915903 }, “Props” : { “Limit” : 1, “RelPer” : -1, “Slaves” : [], “White” : false, “SlavesEx” : [] }, “Name” : “54249e9e9154b52dd0ccbd8c”, “Stubs” : [], “StubCount” : 0, “StubLevel” : 0, “Type” : 1 }

Uploaded the full job jsons for these 3
json__5424449bdcad6c0538d484cc.tar (120 KB)
json__54248f44162dfe20641cd1f4.tar (120 KB)
json__54249e9e9154b52dd0ccbd8c.tar (80 KB)

Its weird that you can have a negative queued chunk count:

“Tasks” : 115, “CompletedChunks” : 115, “QueuedChunks” : -1, “SuspendedChunks” : 0, “RenderingChunks” : 1, “FailedChunks” : 0, “PendingChunks” : 0

Another one, all tasks finished:

“Tasks” : 202, “CompletedChunks” : 201, “QueuedChunks” : 0, “SuspendedChunks” : 0, “RenderingChunks” : 1, “FailedChunks” : 0, “PendingChunks” : 0,
json__5421d2dd14529e0aa8436466.tar (130 KB)

It’s definitely related. We use increment/decrement queries to set the chunk counts in the job object as tasks change state, and if one of those connections times out, this can happen. It seems like your db load has stabilized, which should mean that you will see this issue much less. We are still looking into ways to avoid this from happening.

Cheers,
Ryan

We are still seeing this, even after the db load stabilized. New jobs get this in a relatively calm db environment.

We have jobs submitted around 10pm last night (already in the calm period) that exhibit this behavior

We have several hundred of these

Privacy | Site terms | Cookie preferences