db errors

LaszloSebo · July 8, 2014, 7:08pm

Probably due to the outage we experienced yesterday, some entries in the db got corrupted. Attached are 2 jobs.

One (53bb1d27c5f29c26d86e2dd1) said its ‘pending’, even though most of its tasks were complete, and the rest of the tasks (2 of them) queued. Those 2 queued tasks would however never pick up… I could mark the job suspended, then requeued the 2 queued jobs. The job now shows the proper state (queued), but the tasks are not picking up.

The other one (53bae9edc5f29c207c42ebe7) says its rendering on one machine, when in fact all tasks are complete. There is no way to ‘force’ complete the job, i can’t mark it suspended/complete, nothing. We will just delete it…
53bb1d27c5f29c26d86e2dd1.zip (15 KB)
53bae9edc5f29c207c42ebe7.zip (16.1 KB)

LaszloSebo · July 8, 2014, 7:10pm

Seems like even after manually requeuing those frames, the db still shows 3 pending tasks and -1 queued… even though its 2 queued. Could only render them by resubmitting those tasks as a separate job.

jgaudet · July 8, 2014, 9:47pm

Yeah, this is largely due tasks and the Job info being stored in separate objects. If the DB connection dies after one of them has been updated (and the other hasn’t) it can lead to weirdness like this. We considered maybe adding a catch-all “Repair Job” button where it could reset the Job’s metadata based on the current status of its Tasks, but that could honestly lead to as many problems as it causes if it’s used while Slaves are in the process of updating that Job’s Tasks.

The easiest way to workaround this problem when it comes up is definitely re-submitting the incomplete Tasks as a new Job (as you mentioned doing), at least in 6.2. We’ll see if we can hopefully come up with a smarter way to fix this going forward, I think we should be able to figure out a way to Suspend/Resume the task while forcing the Job’s metadata to ‘rebuild’ itself off the actual state of the Tasks.